【论文阅读】SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

This time it is a preliminary paper record, focusing on translation, and it will be changed to only describe the core ideas in the future.

Welcome to Github: https://github.com/MercuryLc/paper_reading

SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

This paper explains how to detect and diagnose performance problems caused by lock contention.

Abstract

Despite the obvious importance, performance issues related to synchronization primitives are still lacking adequate attention. No literature extensively investigates categories, root causes, and fixing strategies of such performance issues. Existing work primarily focuses on one type of problems, while ignoring other important categories. Moreover, they leave the burden of identifying root causes to programmers.

Despite their obvious importance, performance issues related to synchronization primitives still lack sufficient attention. There is no literature that extensively investigates the categories, root causes, and resolution strategies of such performance problems. Existing works mainly focus on one type of problem while ignoring other important categories. Furthermore, they leave the burden of determining the root cause on the programmer.

This paper first conducts an extensive study of categories, root causes, and fixing strategies of performance issues related to explicit synchronization primitives. Based on this study, we develop two tools to identify root causes of a range of performance issues. Compare with existing work, our proposal, SyncPerf, has three unique advantages. First, SyncPerf’s detection is very lightweight, with 2.3% performance overhead on average.

This paper begins with an extensive study of the categories, root causes, and remediation strategies of performance issues related to explicit synchronization primitives. Based on this research, we developed two tools to determine the root cause of a range of performance issues. Compared with existing work, our proposal SyncPerf has three unique advantages. First, SyncPerf's detection is very lightweight, with an average performance overhead of 2.3%.

Second, SyncPerf integrates information based on callsites, lock variables, and types of threads. Such integration helps identify more latent problems. Last but not least, when multiple root causes generate the same behavior, SyncPerf provides a second analysis tool that collects detailed accesses inside critical sections and helps identify possible root causes. SyncPerf discovers many unknown but significant synchronization performance issues. Fixing them provides a performance gain anywhere from 2.5% to 42%. Low overhead, better coverage, and informative reports make SyncPerf an effective tool to find synchronization performance bugs in the production environment.

Second, SyncPerf integrates information based on call sites, lock variables, and thread types. This integration helps identify more potential problems . Last but not least, when multiple root causes produce the same behavior, SyncPerf provides a second analysis tool that collects detailed visits within critical sections and helps identify possible root causes. SyncPerf found many unknown but important synchronization performance issues. Fixing them can improve performance by 2.5% to 42%. Low overhead, better coverage, and informative reports make SyncPerf an effective tool for finding sync performance bugs in production environments .

1. Introduction

Designing efficient multithreaded programs while maintaining their correctness is not an easy task. Performance issues of multithreaded programs, despite being the primary cause of more than 22% synchronization fixes of server programs [23], get less attention than they deserve. There are many prior works [8, 19, 21, 23, 28, 41, 43] in this domain, but none of them systematically investigates performance issues related to different types of synchronization primitives. Most existing works cannot identify root causes, and provide helpful fixing strategies.

Designing efficient multithreaded programs while maintaining their correctness is no easy task. Performance issues with multithreaded programs, despite being the main reason for over 22% of server program synchronization fixes [23], have received less attention than they deserve. There are many previous works in this area [8, 19, 21, 23, 28, 41, 43], but none systematically investigated the performance issues associated with different types of synchronization primitives. Most existing works fail to identify the root cause and provide useful remediation strategies.

This paper studies various categories, root causes, and fixing strategies of performance issues related to different synchronization primitives such as locks, conditional variables, and barriers. Lock/wait-free techniques and other mechanisms (such as transactional memory [20]) are not covered in this paper. The study divides synchronization related performance issues into five categories: improper primitives, improper granularity, over-synchronization, asymmetric contention, and load imbalance (shown in Table 1). The first four categories are related to various locks, whereas the last one is related to other synchronizations such as conditional variables and barriers.

This article examines various categories, root causes, and remediation strategies of performance issues related to different synchronization primitives such as locks, condition variables, and barriers. This article does not cover lock-free/wait-free techniques and other mechanisms (such as transactional memory [20]). The study categorizes synchronization-related performance issues into five categories: inappropriate primitives, inappropriate granularity, excessive synchronization, asymmetric contention, and load imbalance (as shown in Table 1). The first four categories are related to various locks, while the last category is related to other synchronizations such as condition variables and barriers.

The study shows that the same symptom can be caused by multiple root causes. For example, high contention of locks can occur due to too many data items under the same lock, too-large critical sections, over-synchronization, or asymmetric lock contention (more details in Section 2). Without knowing the root cause, it is difficult for programmers to fix these bugs effectively. The study also shows that different categories of problems may have different symptoms and thus, different solutions. Finally, the study presents some ideas for identifying and fixing these performance issues. The study not only helps users to identify and fix synchronization performance issues, but also enables future research in this domain.

Research shows that the same symptom can have multiple underlying causes. For example, high contention for locks can occur due to too many data items under the same lock, oversized critical sections, excessive synchronization, or asymmetric lock contention (more details in Section 2). It is difficult for programmers to effectively fix these errors without knowing the root cause. The study also suggests that different classes of problems may have different symptoms and thus may have different solutions. Finally, the study offers some ideas for identifying and addressing these performance issues. This research not only helps users identify and fix synchronization performance issues, but also stimulates future research in this area.

Prior work [8, 19, 21, 23, 28, 41, 43] focuses excessively on locks that are both acquired frequently and highly contended. Our first observation is that performance problems can also occur with locks that are not excessively acquired or highly contended. This is shown in Figure 1. Existing work focuses on quadrant 2 or Q2. Locks of Q2 can definitely cause performance issues but they are not the only culprits.

Previous work [8, 19, 21, 23, 28, 41, 43] paid too much attention to frequently acquired and highly contended locks. Our first observation is that performance issues can also occur on locks that are not overacquired or highly contended. This is shown in Figure 1. Existing work is focused on Quadrant 2 or Q2. Q2's locks are definitely causing performance problems, but they're not the only culprit.

SyncPerf finds potential problems with the other two quadrants:

(i) locks that are not acquired many times may slow down a program if the critical sections are large and potentially introduce high contention and/or a long waiting time (Q1);

(ii) locks that are acquired excessively may cause significant performance problems, even if they are barely contended (Q4).

Intuitively, locks of Q3 (lowly contended and not acquired many times) will not cause performance problems. Our second observation is that it is not always sufficient to identify root causes of a problem based on the behavior of a single synchronization. For example, for asymmetric contention where different locks are protecting similar data with different contention rates, we have to analyze the behavior of all those locks that typically have the same initialization and acquisition sites. By checking all of those locks together, we can notice that some locks may have higher contention and acquisition than others.

SyncPerf finds potential problems in the other two quadrants: (i) locks that are not acquired multiple times can slow down the program if the critical section is large and can introduce high contention and/or long wait times (Q1); (ii) ) over-acquired locks can cause serious performance problems even if they have little contention (Q4). Intuitively, Q3's lock (low contention and not acquired multiple times) will not cause performance problems. Our second observation is that the behavior of a single synchronization is not always sufficient to determine the root cause of a problem. For example, for asymmetric contention where different locks protect similar data with different contention rates, we must analyze the behavior of all locks that typically have the same initialization and acquisition sites. By examining all these locks, we can notice that some locks may have higher contention and acquisition than others.

  • In the application for the lock in the call stack, the occupancy rate of the applied lock can be analyzed.

Driven by these two intuitive but novel observations, we develop SyncPerf that not only reports the callsites of performance issues, but also helps diagnose root causes and suggests possible fixes for a range of performance issues related to synchronization primitives. Most existing works [8, 28, 41] just report the callsites of the performance issues (mostly high contention locks), while leaving the burden of analyzing root causes (and finding possible fixes) to programmers. The only work similar to ours was proposed by Yu et al. [43]. However, SyncPerf excels it by having a better detection ability (thanks to the novel observations), a broader scope, and much lower overhead (Section 6).

Driven by these two intuitive but novel observations, we developed SyncPerf, which not only reports the call sites of performance problems, but also helps to diagnose the root cause and raise possible repair suggestions. Most existing work [8, 28, 41] only reports the call sites of performance problems (mainly high contention locks), leaving the burden of analyzing the root cause (and finding possible fixes) to the programmer . The only similar work to ours was presented by Yu et al. [43]. However, SyncPerf has the advantage of better detection capability (thanks to new observations), wider range and lower overhead (Section 6).

SyncPerf starts by monitoring the execution of an application and collecting information about explicit synchronization primitives.

More specifically, it collects (i) for a lock, how many times it is acquired, how many times it is found to be contended, and how long a thread waits for the lock, (ii) for a try-lock, how many times it is called and how many times it fails because of contention, and finally (iii) for load imbalance, how long different threads execute, and how long they are waiting for synchronizations.

SyncPerf also collects callsites for each synchronization operation and thread creation function to help pinpoint the actual problems.

After this, SyncPerf integrates and checks the collected information to identify root causes:

(i) it checks behavior of all locks with the same callsites to identify asymmetric contention issue,

(ii) it computes and compares waiting time of different threads to identify load imbalance issue,

and (iii) it checks individual as well as collective (based on callsites) information of locks (i.e., the number of acquisitions and number of times they are contended) to identify other performance issues. This integration is very important, and helps uncover more performance issues.

SyncPerf is able to find more performance issues than any prior work (Table 2). For some of the problems, such as asymmetric contention, and load imbalance, SyncPerf’s detection tool automatically reports root causes. It also presents an optimal task assignment to solve load imbalance problems. For other problems, SyncPerf provides sufficient information as well as an informal guideline to diagnose them manually. SyncPerf also provides an additional optional tool (that programmers can use offline) to help the diagnosis process.

SyncPerf first monitors the application's execution and gathers information about explicit synchronization primitives . More specifically, it collects (i) a lock, how many times it was acquired, how many times it was found contended for, and how long the thread waited for the lock, (ii) a try-lock, how many times it was called and due to The number of times it failed due to contention, and finally (iii) load imbalance, how long the different threads execute, and how long they wait for synchronization.

SyncPerf also collects call sites for each synchronization operation and thread creation function to help pinpoint the actual problem. After this, SyncPerf integrates and examines the collected information to identify the root cause: (i) it examines the behavior of all locks with the same call site to identify asymmetric contention issues , (ii) it calculates and compares the wait time to identify load imbalance issues , and (iii) it examines lock individual and collective (call-site based) information (i.e., number of acquisitions and number of contentions) to identify other performance issues . This integration is very important and helps to find more performance problems.

SyncPerf was able to find more performance problems than any previous work (Table 2). For some issues, such as asymmetric contention and load imbalance, SyncPerf's detection tools automatically report the root cause. It also proposes an optimal task distribution to solve the load imbalance problem. For other problems, SyncPerf provides enough information and an informal guide to diagnosing them manually. SyncPerf also provides an additional optional tool (which programmers can use offline) to aid in the diagnostic process.

contribution

This paper provides a taxonomy of categories, root causes, and fixing strategies of performance bugs related to explicit synchronization primitives. The taxonomy is useful not only to identify and fix synchronization performance problems but also to enable future research in this field.

This article provides a taxonomy of categories, root causes, and remediation strategies for performance bugs related to explicit synchronization primitives . This taxonomy not only helps identify and fix synchronization performance problems, but also facilitates future research in this field.

SyncPerf uses an intuitive observation that performance problems may occur even when locks are not frequently acquired or highly contended. There is no existing work that actually uses this observation. Due to this observation, SyncPerf finds many previously unknown performance issues in widely used applications.

SyncPerf uses the intuitive observation that performance problems can occur even when locks are acquired infrequently or with high contention. There is no existing work that actually uses this observation. As a result of this observation, SyncPerf has discovered many previously unknown performance issues in widely used applications.

SyncPerf makes a novel observation that it is hard to detect problems such as asymmetric contention and load imbalance by observing the behavior of a single synchronization. To solve this problem, SyncPerf proposes to integrate information based on callsites of lock acquisitions (and initializations), lock variables, and types of threads. This integration also contributes to the detection of some unknown issues.

SyncPerf makes the novel observation that problems such as asymmetric contention and load imbalance are difficult to detect by observing the behavior of individual syncs. To solve this problem, SyncPerf proposes to integrate information based on lock acquisition (and initialization) call sites, lock variables, and thread types. This integration also helps to detect some unknown problems.

Finally, SyncPerf provides two tools that help diagnose root causes of performance bugs.

The first one is a detection tool that can report susceptible callsites and synchronization variables with potential performance issues, and identify some root causes such as asymmetric contention and load imbalance. This tool has extremely low overhead (only 2.3%, on average). The tool achieves such low overhead even without using the sampling mechanism. The low overhead makes the tool a good candidate for the deployment environment. When multiple root causes may lead to the same behavior and thus, cannot be diagnosed easily, SyncPerf provides a heavyweight diagnosis tool that collects detailed accesses inside susceptible critical sections to ease the diagnosis process. Both of these tools are software-only tools that do not require any modification or recompilation of applications, and custom operating system or hardware support.

Finally, SyncPerf provides two tools to help diagnose the root cause of performance errors. The first is a detection tool that reports vulnerable call sites and synchronization variables with potential performance issues and identifies some root causes such as asymmetric contention and load imbalance. The tool has an extremely low overhead (only 2.3% on average). This tool achieves such low overhead even without using a sampling mechanism. Low overhead makes this tool ideal for deployment environments. When multiple root causes may lead to the same behavior and thus cannot be easily diagnosed, SyncPerf provides a heavyweight diagnostic tool that collects detailed accesses within vulnerable critical sections to simplify the diagnostic process. Both tools are software-only tools that do not require any modification or recompilation of the application, nor do they require custom operating system or hardware support.

2. Overview

2.1 Categorization

2.1.1 Improper Primitvies

Programmers may use a variety of synchronization primitives (e.g., atomic instructions, spin locks, try-locks, read- /write locks, mutex locks etc.) to protect shared accesses. These primitives impose different runtime overhead, increasing from atomic instructions to mutex locks. The spin lock of pthread library, for example, incurs 50% less overhead than the mutex lock when there is no contention. However, during high contention, the spin lock may waste CPU cycles unnecessarily [1, 30].

Programmers can use various synchronization primitives (for example, atomic instructions, spin locks, try locks, read/write locks, mutexes, etc.) to protect shared access. These primitives impose different runtime overheads, ranging from atomic instructions to mutex locks. For example, the pthread library's spinlocks incur 50% less overhead than mutexes when there is no contention. However, during periods of high contention, spinlocks can unnecessarily waste CPU cycles [1, 30].


When the called locks are of the same type, how to tell whether the contents protected by the called locks are the same block?

For example, both A and B apply for a lock, how does the computer judge that the locks applied by A and B are not in the same area?


Different synchronization primitives have different use cases. Atomic instructions are best suited to perform simple integer operations (e.g., read-modify-write, addition, subtraction, exchange etc.) on shared variables [9, 34]. Spin locks are effective for small critical sections that have very few instructions but cannot be finished using a single atomic instruction [1, 30]. Read/write locks are useful for readmostly critical sections [26, 32]. Try-locks allow a program to pursue an alternative path when locks are not available [38]. Finally, mutex locks are used when the critical sections contain waiting operations (e.g., conditional wait) and have multiple shared accesses. Any deviation from the preferred use cases may result in performance issues.

Different synchronization primitives have different use cases. Atomic instructions are best suited to perform simple integer operations on shared variables (eg, read-modify-write, add, subtract, swap, etc.) [9, 34]. Spinlocks are very effective for small critical sections that have few instructions but cannot be completed using a single atomic instruction [1, 30]. Read/write locks are useful for read-only critical sections [26, 32]. Trial locks allow programs to seek alternative paths when locks are not available [38]. Finally, mutexes are used when the critical section contains wait operations (for example, conditional waits) and has multiple shared accesses. Any deviation from the preferred use case may cause performance issues.

Improper primitives (usually in Q2 and Q4) typically cause extensive try-lock failures or extensive lock acquisitions, but low to moderate contention. Extensive try-lock failures, where a try-lock fails immediately because the lock is held by another thread, indicate that we should use a blocking method that combines conditional variables with mutexes to avoid continuous trial. Extensive lock acquisitions may incur significant performance degradation even without high contention. The issue of improper primitives is ignored by existing work [8, 19, 23, 28, 41]. However, its importance can be seen from facesim application of PARSEC [3] where changing mutex locks to atomic instructions boosts performance by up to 30.7% (Table 2).

Incorrect primitives (usually in Q2 and Q4) usually result in a large number of try-lock failures or a large number of lock acquisitions, but with low to moderate contention. Extensive try-lock failures, where the try-lock fails immediately because the lock is held by another thread, suggest that we should use a blocking approach combining a condition variable with a mutex to avoid consecutive tries. Even without high contention, extensive lock acquisitions can cause significant performance degradation. Existing works [8, 19, 23, 28, 41] ignore the problem of incorrect primitives. However, its importance can be seen in the facesim application of PARSEC [3], where changing mutexes to atomic instructions improves performance by as much as 30.7% (Table 2).

2.1.2 Improper Granularity

Significant performance degradation may occur when locks are not used with a proper granularity. There are several cases listed as follows.

  1. If a lock protects too many data items (e.g., an entire hash table, as in the memcached-II bug of Table 2), the lock may introduce a lot of contention. Splitting a coarsegrained lock into multiple fine-grained locks helps improve performance.
  2. If a lock protects a large critical section with many instructions, it may cause high contention and thus, a significant slowdown. canneal of PARSEC, for example, has a critical section that includes a random number generator. Only few instructions inside the critical section access the shared data. Although the number of acquisitions is only 15, performance is boosted by around 4% when we move the random generator outside the critical section.
  3. If a critical section has very few instructions, then the overhead of lock acquisitions and releases may exceed the overhead of actual computations inside. In that case, the program can suffer from performance degradation [14]. One possible solution is to merge multiple locks into a single coarse-grained one.

Significant performance degradation may occur when locks are not used with appropriate granularity. There are several situations.

  1. Locks can introduce a lot of contention if the lock protects too many data items (for example, an entire hash table, like the memcached-II bug of Table 2) . Splitting a coarse-grained lock into multiple fine-grained locks can help improve performance.
  2. If a lock protects a large critical section with many instructions , it can cause high contention and thus a significant slowdown. For example, PARSEC's canneal has a critical section that includes a random number generator. Only few instructions in the critical section access shared data. Although the number of acquisitions is only 15, when we move the random generator outside the critical section, the performance improves by about 4%.
  3. If there are few instructions in the critical section, the cost of acquiring and releasing the lock may exceed the cost of the actual internal calculation. In this case, programs may suffer performance degradation [14]. One possible solution is to combine multiple locks into one coarse-grained lock.

Identification:

Locks in the first two cases may incur significant contention. However, without knowing the memory accesses inside the critical section, it is hard to identify this type of problems manually.

Therefore, SyncPerf provides an additional diagnosis tool that tracks all memory accesses protected by a specific lock. Programmers can use the tool offline after some potential problems have been identified by the detection tool. With the collected information, we can easily differentiate between the first two cases as described in Table 1. It is relatively hard to identify the third case.

Recognition: Locks in the first two cases can cause serious contention. However, it is difficult to manually identify such problems without knowing the memory accesses inside the critical section. Therefore, SyncPerf provides an additional diagnostic tool that tracks all memory accesses protected by a particular lock. After the detection tool identifies some potential problems, programmers can use the tool offline. With the collected information, we can easily distinguish the first two cases, as described in Table 1. Identifying the third case is relatively difficult.

2.1.3 Over-synchronization

excessive synchronization

Over-synchronization indicates a situation where a synchronization becomes unnecessary because the computations do not require any protection or they are already protected by other synchronizations. This term is borrowed from existing work [23]. There are the following cases.

Over-synchronization represents a situation where no synchronization is required because the computations do not require any protection, or they are already protected by other synchronizations. This term is borrowed from existing work [23]. There are several situations.

1.A lock is unnecessary if a critical section only accesses the local data, but not the shared data. 2. A lock is unnecessary if the protected computations are already atomic. 3. A lock is unnecessary if another lock already protects the computations.

MySQL-5.1 is known to have such a problem [7, 23], which utilizes the random() routine to determine the spin waiting time inside a mutex. Unfortunately, this routine has an internal lock that unnecessarily serialize every thread invoking this random() routine. The problem has been fixed by using a different random number generator that does not have any internal lock for the fastmutex.

1. If the critical section only accesses local data and does not access shared data, no locking is required. 2. If the protected computation is already atomic, no locking is required. 3. If another lock already protects the computation, no lock is needed.

There is a known issue with MySQL-5.1 [7, 23] that utilizes the random() routine to determine the spin wait time within a mutex. Unfortunately, this routine has an internal lock that unnecessarily serializes each thread that calls this random() routine. The problem has been solved by using a different random number generator that doesn't have any internal locks for fast mutexes.

Identification:

Over-synchronization problems can cause a significant slowdown when there are extensive lock acquisitions. This situation is similar to the first two categories of improper granularity issue. Therefore, our diagnosis tool (described in Section 3.2) may help analyze this situation. After a problem is identified, unnecessary locks can be removed to improve performance. However, removing locks may introduce correctness issue, and has to be done cautiously.

Recognize: Excessive synchronization issues can cause significant slowdowns when there are a large number of lock acquisitions. This situation is similar to the first two categories of problems with improper granularity. Therefore, our diagnostic tool (described in Section 3.2) may be helpful in analyzing this situation. Once the problem is identified, unnecessary locks can be removed to improve performance. However, removing locks can introduce correctness issues and must be done with care.

2.1.4 Asymmetric Contention

Asymmetric contention occurs when some locks have significantly more contention than others that protect similar data. This category is derived from “asymmetric lock” [10]. For instance, a hash table implementation may use bucket-wise locks. If the hash function fails to distribute the accesses uniformly, some buckets will be accessed more frequently than the others. Consequently, locks of those buckets will have more contention than the others. Coz [10] finds such a problem in dedup. Changing the hash function improves performance by around 12%.

Asymmetric contention occurs when some locks have significantly more contention than other locks protecting similar data. This category is derived from "asymmetric locks" [10]. For example, a hash table implementation can use bucket locks. If the hash function does not distribute access evenly, some buckets will be accessed more frequently than others. Therefore, locks for these buckets will have more contention than other buckets. Coz [10] found such a problem in dedup. Changing the hash function improves performance by about 12%.


非对称竞争原因在于锁的争用不能均匀,但什么情况下会出现使用 Hash 来散列锁呢?


Identification: To identify this type of problems, SyncPerf collects the number of lock acquisitions, how many times each lock is found to be unavailable, and their callsites. If multiple locks are protecting similar data (typically identified by the same callsites of lock acquisitions and releases), SyncPerf checks the lock contention rate and the number of acquisitions of these locks. When an asymmetric contention rate is found (e.g., when the highest contention rate is 2 × or more than the lowest one), SyncPerf reports an asymmetric contention problem. Asymmetric contention problem is reported automatically without any manual effort. Programmers, then, can fix the problem by evenly distributing the contention. Unlike SyncPerf, Coz relies on manual inspection to identify this type of problems.

Identification: To identify such problems, SyncPerf collects the number of lock acquisitions, the number of times each lock was found to be unavailable, and their call sites. If multiple locks are protecting similar data (typically identified by the same lock acquisition and release call sites), SyncPerf examines lock contention rates and the number of acquisitions of these locks. SyncPerf reports asymmetric contention problems when asymmetric contention rates are found (for example, when the highest contention rate is 2 or more times the lowest contention rate). Asymmetric contention issues are reported automatically without any manual action required. Programmers can then resolve the problem by distributing contention evenly. Unlike SyncPerf, Coz relies on human inspection to identify such issues.

2.1.5 Load Imbalance

A thread can wait due to synchronizations such as mutex locks, conditional variables, barriers, semaphores etc. A parent thread can also wait when it tries to join with the children threads. If a group of threads (i.e., threads with the same thread function) is found to have a waiting period much longer than that of other groups of threads, this may indicate a performance issue caused by load imbalance [12, 25, 33, 40].

Threads can wait due to synchronization with mutexes, condition variables, barriers, semaphores, etc. The parent thread can also wait while trying to join the child thread. If one group of threads (i.e., threads with the same thread capabilities) are found to wait much longer than other groups of threads, this may indicate a performance problem caused by load imbalance [12, 25, 33, 40] .

Identification: To identify load imbalance problems, it collects the execution and waiting time of different threads by intercepting thread creations and synchronization functions. If the waiting time or computation time of different threads are substantially different (e.g., outside a certain range, say 20%), the program can be identified as having a load imbalance problem.

Identification: To identify load imbalance issues, it collects execution and wait times of different threads by intercepting thread creation and synchronization functions. If the waiting time or computing time of different threads varies significantly (for example, outside a certain range, such as 20%), the program can be identified as having a load imbalance problem.

Finding an optimal task assignment:

SyncPerf can suggest an optimal task assignment for load imbalance problems after the identification. It calculates the computation time of every thread by subtracting all waiting time (on conditional variables, mutex locks, and barriers) from their execution time. It then computes the total computation time of different groups of threads according to their thread functions, where threads executing the same function belong to the same group. In the end, SyncPerf suggests an optimal task distribution – each group of threads will be assigned an optimal number of threads that is proportional to the total workload of that type. Section 4.4.5 presents some examples showing how SyncPerf can suggest an optimal configuration for different types of threads to fix the load imbalance problems.

Finding the best task distribution: SyncPerf can identify the optimal task distribution for load imbalance problems. It calculates the computation time of each thread by subtracting all waiting time (condition variables, mutexes and barriers) from the execution time. The total computation time of different groups of threads is then calculated according to the thread functions, where threads executing the same function belong to the same group. Finally, SyncPerf suggests an optimal task distribution - each group of threads will be assigned an optimal number of threads proportional to the total amount of work of that type. Section 4.4.5 provides some examples showing how SyncPerf can suggest optimal configurations for different types of threads to address load imbalances.

2.2 Workflow

The high level workflow of SyncPerf is shown as Figure 2. For mutex locks, SyncPerf reports locks inside 3 quadrants (Q1, Q2, and Q4 of Figure 1), while skipping Q3 locks that do not cause performance issues. Additionally, it reports try-lock failure rates and whether there is a load imbalance problem among different types of threads. For the load imbalance problem, SyncPerf not only reports the root cause but also suggests an optimal configuration for different types of threads. This is done without any manual intervention. Programmers can use the suggested distribution to fix the load imbalance problem. If there is an asymmetric contention problem among similar locks, the tool automatically identifies the root cause. However, it is up to the programmer to develop a possible fix.

The high-level workflow of SyncPerf is shown in Figure 2. For mutexes, SyncPerf reports locks in 3 quadrants (Q1, Q2, and Q4 in Figure 1), while skipping the Q3 lock which does not cause performance problems. Additionally, it reports the rate of failed lock attempts and whether there is a load imbalance between different types of threads. For load imbalance issues, SyncPerf not only reports the root cause, but also suggests the best configuration for different types of threads. This is done without any human intervention. Programmers can use the proposed distribution to resolve load imbalances. If there is an asymmetric contention issue between similar locks, the tool automatically identifies the root cause. However, it is up to the programmer to develop possible fixes.

After getting the behavior of locks in 3 quadrants, if the reported code segments are simple, programmers can easily inspect them manually to determine which category a problem belongs to and take corresponding actions. This can be as simple as consulting Table 1. For complex situation, our additional diagnosis tool can collect detailed information for critical sections that reported by our detection tool, in order to help programmers determine the particular type of performance issues. Again, Table 1 can be used as an informal guideline during the categorization process. After determining the type of performance bugs, Table 1 can guide programmers to develop a fix for the bug. Some of the fixing strategies (e.g., fixing of over-synchronization problem) might require programmers to carefully consider correctness issues.

After getting the behavior of the three-quadrant lock, if the reported code segment is simple, the programmer can easily check manually to determine which category the problem belongs to and take corresponding actions. This can be as simple as referring to Table 1. For complex cases, our add-on diagnostic tools can collect details on key sections reported by our instrumentation tools to help programmers identify specific types of performance issues. Likewise, Table 1 can be used as an informal guide in the classification process. After identifying the type of performance bug, Table 1 can guide programmers in developing a fix for that bug. Some fix strategies (for example, fixing over-synchronization problems) may require programmers to think carefully about correctness issues.

Implementation Details

SyncPerf provides two tools to assist programmers in identifying bugs and fixing them: a detection tool and a diagnosis tool. By combining these two tools, SyncPerf not only answers “what” and “where” questions, but also “why” and “how to fix” (partially) questions for most synchronization related performance bugs.

SyncPerf provides two tools to help programmers identify and fix errors: detection tools and diagnostic tools. By combining these two tools, SyncPerf not only answers the "what" and "where" questions, but also the "why" and "how to fix" (partial) questions of most synchronization-related performance errors.

The detection tool uses a lightweight profiling scheme to detect synchronizations with potential performance issues. It can also diagnose the root causes for asymmetric contention, extensive try-lock failures, and load imbalance problems without any manual intervention. The detection tool achieves a lower performance overhead than existing tools (even without using the sampling mechanism) [41]. Details of this tool are presented in Section 3.1. The diagnosis tool is based on Pin [29], a binary instrumentation tool. The diagnosis tool monitors memory accesses inside specific critical sections to help identify root causes of problems with the same behavior. This heavyweight diagnosis tool is only employed when the detection tool reports some potential problems that cannot be diagnosed easily. It utilizes prior knowledge of the particular problems that are reported by the detection tool, and thus, instruments memory accesses inside the relevant critical sections only. Its overhead is about 6 × lower than the existing work that instruments all memory accesses [43].

The detection tool uses a lightweight analysis scheme to detect synchronizations with potential performance issues. It can also diagnose the root cause of asymmetric contention, high number of failed lock attempts, and load imbalance issues without any manual intervention. Detection tools achieve lower performance overhead than existing tools (even without using sampling mechanisms) [41]. Details of this tool are presented in Section 3.1. The diagnostic tool is based on Pin [29], a binary detection tool. Diagnostic tools monitor memory accesses within specific critical sections to help determine the root cause of problems with the same behavior. This heavyweight diagnostic tool should only be used if the detection tool reports some underlying problem that is not easy to diagnose. It leverages prior knowledge of specific issues reported by instrumentation tools, and thus, only instrument memory accesses within relevant critical sections. Its overhead is ~6x lower than existing work that detects all memory accesses [43].

3.1 Detection Tool

The challenge of SyncPerf is to collect data effificiently and analyze them effectively.

The challenge for SyncPerf was to efficiently collect data and analyze them.

3.1.1 Collecting Data Efiiciently

To collect the data, SyncPerf intercepts pthread’s different types of explicit synchronization primitives, such as mutex locks, try-locks, conditional variables, barriers, and thread creation and exit functions, where the actual implementation is borrowed from the pthread library. This is similar to existing work [41]. However, SyncPerf outperforms them with a lower overhead and better detection ability.

To collect data, SyncPerf intercepts different types of explicit synchronization primitives of pthreads , such as mutexes, try-locks, condition variables, barriers, and thread creation and exit functions, where the actual implementation is borrowed from the pthread library. This is similar to existing work [41]. However, SyncPerf outperforms them with lower overhead and better detection capabilities.


The following implementation is important. Follow-up needs to compare how the source code operates.

RDTSC hours


SyncPerf intercepts pthread create function calls and passes a custom function to the actual pthread create function. This custom function calls the actual start routine function, and collects timestamps of thread starting and exiting using RDTSC timer [22]. The timestamps are saved into a thread wrapper as shown in Figure 3(b).

SyncPerf intercepts pthread create function calls and passes a custom function to the actual pthread create function. This custom function calls the actual start routine function and uses the RDTSC timer [22] to collect timestamps of thread start and exit. Timestamps are saved into thread wrappers, as shown in Figure 3(b).

SyncPerf utilizes the following mechanisms to achieve the extremely low overhead.

SyncPerf utilizes the following mechanisms to achieve extremely low overhead.

Indirection and per-thread data:

To collect data for mutex locks, a possible approach (used by existing work [41]) is to store the actual profiling data for each mutex lock in a global hash table.

Upon every mutex invocation, we can lookup the hash table to find the pointer to the actual data, and then update it correspondingly. However, this approach introduces significant overhead due to the hash table lookup (and possible lock protection) on every synchronization operation, and the possible cache coherence messages to update the shared data (true/false sharing effect) [16, 21]. This is especially problematic when there is a significant number of acquisitions.

Indirect and Per-Thread Data: To collect data on mutexes, one possible approach (used by existing work [41]) is to store the actual profiling data for each mutex in a global hash table. On each mutex call, we can look up the hash table to find the pointer to the actual data, and then update it accordingly. However, this approach introduces a significant living expenses. This is especially problematic when there are a large number of acquisitions.


Instead, SyncPerf uses a level of indirection to avoid the lookup overhead, and a per-thread data structure to avoid the cache coherence traffic. The data structure is shown in Figure 3(a).

For every mutex, SyncPerf allocates a shadow mutex t object and uses the first word of the original mutex t object as a pointer to this shadow object. The shadow mutex structure contains a real mutex t object, an index for this mutex object, and some other data.

The index is initialized during the initialization of the mutex, or during the first lock acquisition if the mutex is not explicitly initialized. This index is used to find an entry in the global Mutex Data Table, where each thread has a thread-wise entry. When a thread operates on a mutex lock, say Li, SyncPerf obtains the shadow mutex t object by checking the first word of the original mutex t object, and then finds its corresponding thread-wise entry using the index value.

After that, the lock related data can be stored in its thread-wise entry, without generating any cache coherence message. Furthermore, SyncPerf prevents the false sharing effect by carefully keeping read-mostly data in shadow mutex t object and padding them properly [4, 27], while the actual profiling data (that keeps changing) is stored in thread-wise entries. The thread-wise data is collected and integrated in the reporting phase (Section 3.1.2).

Instead, SyncPerf uses a level of indirection to avoid lookup overhead and uses per-thread data structures to avoid cache coherency traffic. The data structure is shown in Figure 3(a). For each mutex, SyncPerf allocates a shadow_mutex_t object and uses the first word of the original mutex t object as a pointer to the shadow object. The shadow mutex structure contains a real mutex object, the index of this mutex object and some other data. The index is initialized during initialization of the mutex, or if the mutex is not explicitly initialized, during the first lock acquisition. This index is used to look up entries in the global mutex data table, where each thread has a thread-level entry.

When a thread operates on a mutex, say Li, SyncPerf obtains the shadow mutex t object by examining the first word of the original mutex t object, and then uses the index value to find its corresponding per-thread entry. Afterwards, lock-related data can be stored in its thread entry without generating any cache coherency messages. Furthermore, SyncPerf prevents false-sharing effects by carefully keeping read-only data in shadow mutex objects and filling them correctly [4, 27], while the actual profiling data (which is constantly changing) is stored in thread entries. Thread data is collected and integrated during the reporting phase (Section 3.1.2).


Fast collection of callsites:

SyncPerf collects callsite information of every synchronization operation to provide exact source code location of performance bugs. It is crucial to minimize the overhead of collecting callsites, especially when there is a large number of synchronization operations.

SyncPerf makes three design choices to reduce the overhead.

First, SyncPerf avoids the use of backtrace API of glibc, which is extremely slow due to its heavyweight instruction analysis. Instead of using backtrace, SyncPerf analyzes frame pointers to obtain call stacks efficiently. However, this can impose a limitation that SyncPerf cannot collect callsite information for programs without frame pointers.

Second, SyncPerf collects call stacks up to the depth of 5. We limit the depth because deeper stacks introduce more overhead without any significant benefit.

Quick Collection of Call Sites: SyncPerf collects call site information for each synchronous operation to provide the exact source code location of performance errors. Minimizing the overhead of collecting call sites is critical, especially when there are a lot of synchronous operations.

SyncPerf made three design choices to reduce overhead. First, SyncPerf avoids using glibc's backtrace API, which is very slow due to its heavyweight instruction analysis. SyncPerf does not use backtraces,** instead it analyzes frame pointers to efficiently get the call stack. **However, this may limit SyncPerf's inability to collect call-site information for programs that do not have frame pointers.

Second, SyncPerf collects a call stack with a depth of 5. We limited the depth because a deeper stack would introduce more overhead without any significant benefit.

Third, SyncPerf avoids collecting already-existing callsites. Obtaining the callsite of a synchronization and comparing it against all existing callsites one by one (to determine whether this is a new one) may incur substantial overhead. Alternatively, SyncPerf utilizes the combination of the lock address and the offset between the stack pointer (rsp register) and the top of the current thread’s stack to identify the call stack. When different threads invoke a synchronization operation at the same statement, the combination of the lock address and stack offset are likely to be the same. If a combination is the same as that of one of the existing callsites, SyncPerf does not collect callsite information. This method can significantly reduce the overhead of callsite collection and comparison.

Third, SyncPerf avoids collecting already existing call sites. Taking a synchronized call site and comparing it to all existing call sites one by one (to determine if this is new) can incur a lot of overhead. Alternatively, SyncPerf uses a combination of the lock address and the offset between the stack pointer (rsp register) and the top of the current thread's stack to identify the call stack. When different threads invoke synchronized operations in the same statement, the combination of lock address and stack offset is likely to be the same. SyncPerf does not collect call site information if the combination is the same as that of one of the existing call sites. This approach can significantly reduce the overhead of call site collection and comparison.

Other mechanisms: To further reduce the runtime overhead, SyncPerf avoids any overhead due to memory allocation by preallocating the Mutex Data Table and a pool of shadow mutex objects. This is done during the program initialization phase. SyncPerf assumes a predefined but adjustable maximum number of threads and mutex objects for this purpose. Also, SyncPerf puts data collection code outside a critical section as much as possible to avoid expanding the critical section. This avoids unnecessary serialization of threads.

Other Mechanisms: To further reduce runtime overhead, SyncPerf avoids any overhead due to memory allocation by pre-allocating Mutex data tables and shadow mutex pools. This is done during the program initialization phase. SyncPerf assumes a predefined but tunable maximum number of threads and mutexes for this purpose. Additionally, SyncPerf places data collection code outside of critical sections as much as possible to avoid expanding critical sections. This avoids unnecessary thread serialization.

Because of these careful design choices, SyncPerf imposes very low runtime overhead (2.3%, on average). Even for an application such as fluidanimate that acquires 40K locks per millisecond, SyncPerf imposes only 19% runtime overhead. Due to its low overhead, SyncPerf’s detection tool can be used in production runs.

Due to these careful design choices, SyncPerf has a very low runtime overhead (2.3% on average). Even for an application like fluid animation that acquires 40K locks per millisecond, SyncPerf only incurs a 19% runtime overhead. Due to the low overhead, SyncPerf's instrumentation tools can be used for production runs.

3.1.2 Analyzing and Reporting Problems

SyncPerf reports problems when a program is about to exit or it receives a special signal like SIGUSER2. SyncPerf performs two steps to generate a report.

SyncPerf reports problems when the program is about to exit or when it receives a special signal such as SIGUSER2. SyncPerf performs two steps to generate reports.

First, it combines all thread-wise data of a particular synchronization together to check the number of lock acquisitions, lock contentions, and try-lock failures. It reports potential problems if any synchronization variable shows the behavior listed in Section 2.

First, it combines all thread data for a particular synchronization to examine the number of lock acquisitions, lock contention, and failed lock attempts . If any synchronization variable exhibits the behavior listed in Section 2, it reports a potential problem.

Second, SyncPerf integrates information of different synchronization variables and threads together in order to discover more potential problems. (1) The behavior of locks with the same callsites are compared with each other: if some locks have significantly more contention than others, then there is a problem of asymmetric contention (Section 2.1.4). (2) Even if one particular lock is not acquiredmany times, the total number of acquisitions of locks with the same callsite can be significant and thus, cause a severe performance issue. (3) SyncPerf integrates information of different threads together to identify load imbalance problems. When one type of threads (with the same thread function) have “disproportionate waiting time”, it is considered to be a strong indicator for the load imbalance issue (Section 2.1.5). The integration of information helps find more potential problems.

Second, SyncPerf integrates information from different synchronization variables and threads to find more potential problems. (1) Compare the behavior of locks with the same call site: If some locks have significantly more contention than others, there is an asymmetric contention problem (§2.1.4). (2) Even if a particular lock is not acquired multiple times, the total number of lock acquisitions at the same call site can be large, thus causing serious performance problems. (3) SyncPerf integrates information from different threads to identify load imbalance issues. When a type of thread (with the same thread function) has a "disproportionate wait time", it is considered a strong indicator of a load imbalance problem (Section 2.1.5). Information integration helps to discover more potential problems.

3.2 Diagnosis Tool

The same behavior (e.g., lock contention) may be caused by different root causes, such as asymmetric contention, improper granularity, or over-synchronization. Therefore, SyncPerf provides a heavyweight diagnosis tool to help identify root causes of such problems. This heavyweight diagnosis tool is optional and not meant for production runs. Only when some potential problems are detected but they are hard to be diagnosed manually, this diagnosis tool may provide further information (e.g., memory accesses inside critical sections) that include: how many instructions are executed on average inside each critical section; how many of these instructions access shared and non-shared locations; how many different memory locations are accessed inside a critical section; and how many instructions are read or write accesses.

The same behavior (for example, lock contention) may be caused by different root causes, such as asymmetric contention, inappropriate granularity, or excessive synchronization. Therefore, SyncPerf provides a heavyweight diagnostic tool to help determine the root cause of such problems. This heavyweight diagnostic tool is optional and not intended for production runs. Only if some potential problem is detected but difficult to diagnose manually, the diagnostic tool will provide more information (for example, memory access inside the critical section), including: how many instructions are executed per critical section on average; How many accesses to shared and unshared locations; how many different memory locations are accessed in critical sections; and how many instructions are read or write accesses.

SyncPerf’s diagnosis tool is based on a binary instrumentation framework, Pin [29]. It takes a list of problematic locks (along with their callsites) as the input, which is generated from the detection tool’s report. When a lock function is encountered, it checks whether the lock is one of the problematic ones. If so, it keeps counting the instructions, and monitoring the memory accesses inside. The tool also maintains a hash table to keep track of memory locations inside critical sections. The hash table helps find out how many data items have been accessed inside a critical section. This information help identify the situation where a lock protects too many data items, or too many instructions that are accessing non-shared data inside a critical section. Like the detection tool, the diagnosis tool maintains thread-wise and lock-wise counters for each synchronization. It also integrates information together in the end.

SyncPerf's diagnostic tools are based on the binary instrumentation framework Pin [29]. It takes as input a list of problematic locks (along with their call sites), which is generated from the instrumentation tool's report. When a lock function is encountered, it checks to see if the lock is one of the problematic locks. If so, it continues to count instructions and monitor internal memory accesses. The tool also maintains a hash table to track memory locations within critical sections. A hash table helps to find out how many data items are accessed in a critical section. This information helps identify situations where locks protect too many data items, or where there are too many instructions accessing unshared data within a critical section. Like the instrumentation tool, the diagnostic tool maintains thread and lock counters for each synchronization. It also ultimately integrates information.

4. Evaluation

This section will answer the following questions:

  • Usage Example: What are the outputs of SyncPerf’s tools? How we can utilize the report to identify root causes? (Section 4.2)
  • Bug Detection Ability: Can SyncPerf detect real performance bugs related to synchronizations? (Section 4.3 and 4.4)
  • Performance Overhead: What is the performance overhead of SyncPerf’s detection and diagnosis tools? (Section 4.5)
  • Memory Overhead: What is the memory overhead of the detection tool? (Section 4.6)

Example usage: What is the output of the SyncPerf tool? How can we use this report to determine root cause? (Section 4.2)

Vulnerability detection capabilities: Can SyncPerf detect real performance vulnerabilities related to synchronization? (Sections 4.3 and 4.4)

Performance overhead: What is the performance overhead of SyncPerf's instrumentation and diagnostic tools? (Section 4.5)

Memory overhead: What is the memory overhead of the instrumentation tool? (Section 4.6)

4.1 Experimental Setup

We performed experiments on a 16-core idle machine, with two-socket Intel(R) Xeon(R) CPU E5-2640 processors and 256GB of memory. It has 256KB L1, 2MB L2, and 20M L3 cache. The experiments were performed on the unchanged Ubuntu 14.10 operating system. We used GCC-4.9.1 with -O2, -g and -fno-omit-frame-pointer flags to compile all applications. SyncPerf utilizes the following parameters for the detection: contention rate larger than 10% is considered to be high, and the number of lock acquisition larger than 1000 per second is considered to be high. These thresholds are empirically determined. The parameters can be easily adjusted during the compilation of the detection tool. Section 4.3 evaluates false positives when using these parameters.

We conducted experiments on a 16-core idle machine with two-socket Intel(R) Xeon(R) CPU E5-2640 processors and 256GB of memory. It has 256KB L1, 2MB L2 and 20M L3 cache. Experiments were performed on an unchanged Ubuntu 14.10 operating system. We compile all applications using GCC-4.9.1 with the -O2, -g and -fno-omit-frame-pointer flags. SyncPerf is instrumented with the following parameters: a contention rate greater than 10% is considered high, and lock acquisitions greater than 1000 per second are considered high. These thresholds are determined empirically. Parameters can be easily tuned during compilation of the detection tool. Section 4.3 evaluates false positives when using these parameters.

Evaluated Applications: We used a well-tuned benchmark suite, PARSEC [3], with native inputs. PARSEC applications have complexity comparable with real applications (see Table 3). We also evaluated three widely used real world applications: Apache, MySQL, and Memcached. We ran Apache-2.4 server program with the ab client that is distributed with the source code. We tested MySQL-5.6.27 using the sysbench client and the mysql-test. For memcached, we evaluated two different versions – memcached- -1.4.4, and memcached-2.4.24, which are all exercised using the memslap benchmark. Data presented in the paper is for memcached-2.4.24 unless otherwise mentioned.

Evaluated Applications: We use the well-tuned benchmark suite PARSEC [3] with local inputs. PARSEC applications have comparable complexity to real applications (see Table 3). We also evaluated three widely used real-world applications: Apache, MySQL, and Memcached. We run the Apache-2.4 server program using the ab client distributed with the source code. We tested MySQL-5.6.27 using the sysbench client and mysql-test. For memcached, we evaluated two different versions—memcached-1.4.4 and memcached-2.4.24, both using the memslap benchmark. Data presented in this article applies to memcached-2.4.24 unless otherwise stated.

4.2 Usage Examples

SyncPerf provides two tools that help identify the root causes of problems. This section shows a usage example for application canneal of PARSEC.Figure 4 shows an example report generated by SyncPerf’s detection tool. For locks, it reports the results of three quadrants as shown in Figure 1. For each lock, SyncPerf reports source code information. For canneal, SyncPerf only reports one lock with high contention rate and low acquisition frequency in rng.h file. The corresponding code is shown in Figure 6. It is not very easy to understand this case. Therefore, we can resort to SyncPerf’s diagnosis tool.

SyncPerf provides two tools to help determine the root cause of a problem. This section shows an example usage of the PARSEC application channel. Figure 4 shows a sample report generated by SyncPerf's detection tool. For locks, it reports results in three quadrants, as shown in Figure 1. For each lock, SyncPerf reports source code information. For canneal, SyncPerf only reports a highly contended, infrequently acquired lock in the rng.h file. The corresponding code is shown in Figure 6. This case is not very easy to understand. Therefore, we can turn to SyncPerf's diagnostic tool.

The diagnosis tool takes the reported locks from a specified file in the same directory, mostly call stacks of corresponding locks, as the input. An example of the report is shown in Figure 5. For canneal application, SyncPerf’s diagnosis tool reports that only less than 1% instructions access the shared memory. Further consultation of the source code indicates that seed is the only shared access inside the critical sections. However, canneal currently puts the whole random generator inside the critical section, as described in Section 4.4.3. Moving the random generator out of the critical section improves the performance of this application by 4%.

The diagnostic tool takes as input the reported locks, primarily the call stacks corresponding to the locks, from specified files in the same directory. An example report is shown in Figure 5. For canneal applications, SyncPerf's diagnostic tool reports that less than 1% of instructions access shared memory. Further inspection of the source code shows that the seed is the only shared access within the critical section. However, canneal currently puts the entire random generator into a critical section, as described in Section 4.4.3. Moving the random generator out of the critical section improves the application's performance by 4%.

4.3 Effectiveness

SyncPerf is effective in detecting synchronization related performance bugs. The results are shown in Table 2. SyncPerf detected nine performance bugs in PARSEC and six performance bugs in real world applications. Among the 15 performance bugs, seven were previously undiscovered, including three in large real applications such as MySQL and memcached. We have notified programmers of all of these new performance bugs. The MySQL-I bug does not exist any more because the corresponding functions have been removed in a later version (MySQL-5.7). Remaining bugs are still under review.

SyncPerf efficiently detects synchronization-related performance errors. The results are shown in Table 2. SyncPerf detected 9 performance bugs in PARSEC and 6 performance bugs in real applications. Of the 15 performance bugs, 7 were previously undiscovered, and 3 of them were found in large real-world applications such as MySQL and memcached. We have notified programmers of all these new performance bugs. The MySQL-I bug no longer exists because the corresponding functionality has been removed in a later version (MySQL-5.7). The remaining bugs are still under review.

False Positives:

We evaluated false positives of SyncPerf, using the threshold for contention rate and acquisition frequency of 10% and 1000 per second respectively. SyncPerf has no false positives for 12 programs (Table 2) of PARSEC and Memcached application. SyncPerf reports two potential performance problems in Apache. We have fixed one of them, with around 8% performance improvement. SyncPerf reports another one with high acquisition frequency (1252 per second) and low contention rate (4.5%). This is related to one big mutex of the queue. Fixing this problem requires significant changes in code. Therefore, we did not solve this problem. For MySQL, SyncPerf reports three potential performance bugs. Two of them have been fixed with performance improvement of 19% and 11% respectively. Another bug is related to keycache'scache lock. This lock has high acquisition frequency (1916 per second) and low contention rate (0.0%). We tried to use spin lock as a replacement for the mutex lock, but we did not achieve any performance improvement. Therefore, this could be a potential false positive of SyncPerf. Thus, SyncPerf reports only two potential false positives at most.

False Positives: We evaluated SyncPerf for false positives using contention rate thresholds and acquisition frequencies of 10% and 1000 per second, respectively. SyncPerf has no false positives for the 12 programs (Table 2) of the PARSEC and Memcached applications. SyncPerf reports two potential performance issues in Apache. We've fixed one of them and improved performance by about 8%. SyncPerf reports another report with high acquisition frequency (1252 per second) and low contention rate (4.5%). This has to do with a big mutex for the queue. Fixing this issue required major changes to the code. Therefore, we did not solve this problem. For MySQL, SyncPerf reports three potential performance bugs. Two of them were fixed, increasing performance by 19% and 11%, respectively. Another bug is related to keycache'scache lock. The lock has a high acquisition frequency (1916 per second) and low contention (0.0%). We tried using spinlocks as an alternative to mutexes, but didn't achieve any performance gains. So this could be a potential false positive for SyncPerf. Therefore, SyncPerf only reports at most two potential false positives.

False Negatives:

It is difficult to assess whether SyncPerf has any false negative or not since there is no oracle that provides a complete list of all performance bugs in the evaluated applications. One option would be to experiment with known performance bugs. Our results indicate that SyncPerf detects all known performance bugs from the evaluated applications.

False Negatives: It is difficult to evaluate whether SyncPerf has any false negatives, because no oracle can provide a complete list of all performance errors in an evaluated application. One option is to try known performance bugs. Our results show that SyncPerf detects all known performance bugs from the evaluated applications.

4.4 Case Studies

This section provides more details about the detected performance bugs.

4.4.1 Extensive Acquisitions and High Contention

Existing tools [8, 19, 23, 28, 41, 43] mainly focus on performance bugs with this external symptom. However, only 4 out of 15 detected bugs have this symptom and they belong to three different categories as described below.

Existing tools [8, 19, 23, 28, 41, 43] mainly focus on performance bugs with such external symptoms. However, only 4 of the 15 detected errors had this symptom, and they fell into three different categories as described below.

Asymmetric Contention: dedup is a compression program with data de-duplication algorithm. It has extensive lock acquisitions (23531 per second) and a high contention rate (13.6%) in an array of locks (encoder.c:1051). These locks protect different buckets of a hash table. SyncPerf detects these locks with asymmetric contention problems: these locks (with the same callsite) have different number of lock acquisitions, ranging from 3 to 8586; the one with the most acquisitions has a contention rate of 13.6%, while others have less than 1% contention rate. This bug is detected by Coz, but that requires expertise to identify the root cause [10]. Instead, SyncPerf can automatically identify this bug, without resorting to manual expertise. By changing the hash function to reduce hash collision using the prosed fix by the Coz paper, the performance is improved by 12.1%.

Asymmetric Contention: dedup is a compression program with deduplication algorithm. It has a high number of lock acquisitions (23531 per second) and high contention (13.6%) in the lock array (encoder.c:1051). These locks protect the different buckets of the hash table. SyncPerf detects these locks with asymmetric contention issues: these locks (with the same call site) have different lock acquisition counts, ranging from 3 to 8586; the most acquired contention rate is 13.6%, while the others are less than 1%. Coz detected this error, but this required expertise to determine the root cause [10]. Instead, SyncPerf can automatically identify this error without resorting to human expertise. By changing the hash function to reduce hash collisions using the hash fix from the Coz paper, performance improved by 12.1%.

Improper Granularity: Memcached-1.4.4 has a known performance bug caused by improper granularity of locks. It uses a single cache lock to protect an entire hash table [13]. When we used memslap to generate 10000 get and set requests to exercise Memcached (with 16 threads), SyncPerf detects 71405 lock acquisitions per second and a high contention rate (45.8%). The diagnosis tool finds that a single lock protects over 9 million different shared locations. Clearly, this lock is too coarse-grained. Changing the global cache lock to an array of item lock as appeared in Memcached-2.4.24 improves the throughput by 16.3%. This bug is shown as memcached-II in Table 2.

Inappropriate granularity: Memcached-1.4.4 has a known performance bug caused by inappropriate lock granularity. It uses a single cache lock to protect the entire hash table [13]. When we exercised Memcached (16 threads) using memslap to generate 10000 get and set requests, SyncPerf detected 71405 lock acquisitions per second and a high contention rate (45.8%). The diagnostic tool found that a single lock could protect more than 9 million different shared locations. Obviously, the granularity of this lock is too coarse. Changing the global cache lock to an array of item locks present in Memcached-2.4.24 improves throughput by 16.3%. This error is shown in Table 2 as memcached-II.

MySQL, a popular database server program, has a similar problem (MySQL-II in Table 2) [2]. When the input table data is not using the default character set of the server or latin1, MySQL calls get internal charset() function. SyncPerf detects extensive lock acquisitions (146299 per second) and a high contention rate (38.5%). Furthermore, SyncPerf’s diagnosis tool reports that a single mutex lock protects 512 different shared variables, with 16384 bytes in total. By replacing the lock with an array of locks with one lock per charset [2], the throughput of MySQL is improved by 10.9%.

MySQL, a popular database server program, has similar problems (MySQL-II in Table 2) [2]. When the table data entered does not use the server's default character set or latin1, MySQL calls the get internal charset() function. SyncPerf detected a high number of lock acquisitions (146299 per second) and high contention (38.5%). Additionally, SyncPerf's diagnostic tool reports that a single mutex can protect 512 different shared variables, totaling 16384 bytes. By replacing the locks with an array of locks [2] with one lock per character set, MySQL's throughput increased by 10.9%.

SyncPerf reports a new performance bug (MySQL-I) in end_thr_alarm function of MySQL. SyncPerf reports extensive lock acquisitions (723K per second) and a high contention rate (25.5%) for mutex LOCK alarm. The critical section has unnecessary conditional waits inside, possibly caused by code evolution. Programmers might have restructured the code logic, but forgot to remove these unnecessary waits. Removing the conditional wait improves performance of MySQL by 18.9%. We have reported this problem to programmers of MySQL and they replied that the corresponding code has been removed in MySQL-5.7.

SyncPerf reported a new performance bug (MySQL-I) in the MySQL end_thr_alarm function. SyncPerf reports a high number of lock acquisitions (723K per second) and a high contention rate (25.5%) for mutex alerts. Unnecessary conditional waiting inside the critical section may be caused by code evolution. Programmers may have refactored code logic, but forgot to remove these unnecessary waits. Removing conditional waits improved MySQL performance by 18.9%. We have reported this problem to the MySQL programmers, and they replied that the corresponding code has been removed in MySQL-5.7.

4.4.2 Extensive Acquisitions but Low Contention

These locks are in Q4 of Figure 1 and are practically ignored by existing tools. As shown in Table 2, 5 out of 15 performance bugs fall into this category. They are new performance bugs detected by SyncPerf.

These locks are in quadrant 4 of Figure 1 and are virtually ignored by existing tools. As Table 2 shows, 5 out of 15 performance bugs fall into this category. They are new performance errors detected by SyncPerf.

Improper Primitives: facesim is a PARSEC application that simulates the motion of human faces. SyncPerf detects that one type of locks (ones with the same callsite) has 15288 acquisitions per second but the contention rate is very low (4.6%). We replaced mutex locks and conditional variables with atomic instructions, and that improved the performance by 31%. A code snippet of fix is shown in Figure 7.

Improper Primitives: facesim is a PARSEC application that simulates the movement of human faces. SyncPerf detected 15288 acquisitions per second for one type of lock (lock with the same call site), but the contention rate was very low (4.6%). We replaced mutexes and condition variables with atomic instructions, improving performance by 31%. The fix code snippet is shown in Figure 7.

5. Limitations and Future Work

SyncPerf has some limitations:

First, SyncPerf cannot identify performance bugs due to ad hoc synchronizations [42], atomic instructions or transactional memory [20]. Currently, it only focuses on performance problems related to explicit synchronization primitives. More specifically, the current implementation only supports POSIX APIs related to synchronizations, and is only verified on the Linux. However, the same idea can be easily applied to other threading libraries.

Second, SyncPerf cannot check contention of internal locks inside the glibc library. This can be fixed if the implementation is embedded inside the glibc library.

Finally, when there is no frame pointer inside a program’s binary, SyncPerf may need to use backtrace to acquire callsite information or the program may requires the recompilation. The first method may incur more overhead for its detection tool.

In future, we would like to extend our tools to overcome some of these limitations. In addition, we would like to include a graphical interface so that some visual representation of results can be provided.

SyncPerf has some limitations:

First, SyncPerf cannot identify performance errors due to temporal synchronization [42], atomic instructions, or transactional memory [20]. Currently, it only focuses on performance issues related to explicit synchronization primitives . More specifically, the current implementation supports only synchronization-related POSIX APIs, and is only validated on Linux. However, the same idea can be easily applied to other threading libraries.

Second, **SyncPerf cannot check for internal lock contention in the glibc library. ** This can be fixed if the implementation is embedded in the glibc library.

Finally**, when there is no frame pointer in the program's binary, SyncPerf may need to use backtraces to obtain call-site information, or the program may need to be recompiled. **The first method may incur more overhead for its detection tools.

In the future, we hope to extend our tool to overcome some of these limitations. Additionally, we would like to include a graphical interface so that a visual representation of some of the results can be provided.

Guess you like

Origin blog.csdn.net/Mercury_Lc/article/details/127466756