Lockless CAS

1. Spin lock

1.1 Spin lock definition

Spin lock is a lock introduced specifically to prevent multi-processor concurrency. It is widely used in interrupt processing and other parts of the kernel (for a single processor, to prevent concurrency in interrupt processing, you can simply turn off the interrupt , That is , turn off/on the interrupt flag bit in the flag register , no spin lock is required).

Spin lock is a lock mechanism proposed to protect shared resources . In fact, spin locks are similar to mutex locks, they are both to solve the mutually exclusive use of a certain resource. Whether it is a mutex lock or a spin lock, there can be at most one holder at any time. In other words, at most one execution unit can obtain the lock at any time.


1.2 The basic form of spin lock

#include <pthread.h>

static pthread_spinlock_t spinlock;
pthread_spin_init(&spinlock, PTHREAD_PROCESS_PRIVATE);

pthread_spin_lock(&spinlock);
//临界区
pthread_spin_unlock(&spinlock);

1.3 Spinlock characteristics

1. Spin locks have the following characteristics:
1) When a thread acquires a lock, if the lock is held by another thread, the current thread will wait in a loop until the lock is acquired.
2) During the spin lock waiting period, the thread status will not change, and the thread will always be in user mode and active.
3) If the spin lock holds the lock for too long, it will prevent other threads from running and scheduling, causing other threads waiting to acquire the lock to run out of cpu.
4) The spin lock itself cannot guarantee fairness, nor can it guarantee reentrancy.

2. Advantages The
spin lock will not switch the thread state and is always in the user state, that is, the thread is always active, and it will not cause the thread to enter the blocking state, reducing unnecessary context switching and fast execution.

The non-spin lock enters the blocking state when the lock is not acquired, and thus enters the kernel state. When the lock is acquired, it needs to recover from the kernel state and requires a thread context switch. (After the thread is blocked, it enters the kernel (Linux) scheduling state. This will cause the system to switch back and forth between the user mode and the kernel mode, which will seriously affect the performance of the lock)


1.4 The difference between spin lock and mutex lock

For mutex , if the resource is already occupied, the resource applicant can only enter the sleep state.
But the spin lock will not cause the caller to sleep . If the spin lock has been held by another execution unit, the caller will keep looping there to see if the holder of the spin lock has released the lock. The term "spin" That's why it got its name.

Spin lock: If it can't be acquired in time, the spin lock is a blind waiting process, occupying the CPU.
Task characteristics:

  • No blocking
  • Short task time

The following two statements cannot use atomic variables:

count++;
sum+=count;

In summary, there are the following differences:

  • Mutex: Sleep without getting the lock and give up the CPU. Trying to acquire a lock is not time-consuming, and waiting for a lock is time-consuming.
    Spin lock: If the lock cannot be acquired, continue testing.
  • Mutex: It is suitable for long-term maintenance, which will cause the caller to sleep and can only be used in the context of the process.
    Spin lock: It is suitable for short-holding situations, does not cause the caller to sleep, and can be used in any context.
  • If the protected shared resource needs to be accessed in the terminal context (including the bottom half that is the interrupt handler and the top half that is the soft interrupt), it is necessary to use a spin lock.
  • The semaphore and read-write semaphore hold period can be preempted.
    The preemption is invalid while the spin lock is held.
  • The spin lock is only really needed when the kernel is preemptible or SMP (multi-processor). In a single-CPU and non-preemptible kernel, all spin lock operations are no-ops and basically have no effect. Spin locks are mainly used in the kernel to prevent concurrent access to critical areas in multiprocessors and to prevent competition caused by kernel preemption.
  • Spin locks do not allow tasks to sleep. Sleeping tasks that hold spin locks will cause deadlocks (because sleep may cause the kernel tasks holding locks to be rescheduled and apply for the locks they already hold again).

Mutex is generally used when obtaining public resources.
Spin locks are more suitable for situations where the lock is used for a short holding time. It is precisely because the user of the spin lock generally keeps the lock time very short, so it is very necessary to choose the spin lock instead of sleep. Spinlocks are more efficient than mutexes and can be used in any context.

For example:
mysql has only one connection and 3 thread requests, choose to use mutex.
If you use a spin lock to request thread A, the requests of B and C will always be waiting for a long time.
If each request takes 30ms, then B will be executed after 30ms, and C will be executed after 60ms. This is very CPU intensive.
Insert picture description here


1.5 Lock selection

When to use mutex, when to use atomic, and when to use spinlock?
mutex: The shared area runs longer.
atomic: Simple numerical addition and subtraction operations.
Spinlock: few executed statements, non-blocking.



2. CAS

2.1 CAS definition

Compare and swap (compare and swap) is a kind of atomic operation, which can be used to realize uninterrupted data exchange operation in multi-threaded programming, so as to avoid the uncertainty of execution order and interruption when multiple threads simultaneously rewrite a certain data Data inconsistency caused by the unpredictability of This operation compares the value in the memory with the specified data, and replaces the data in the memory with the new value when the value is the same.

bool CAS(int * pAddr, int nExpected, int nNew)
 atomically {
    
    
     if ( *pAddr == nExpected ) {
    
    
          *pAddr = nNew ;
          return true ;
     }
     else
         return false ;
}

2.2 CAS algorithm understanding

CAS is a lock-free algorithm. CAS has 3 operands, the memory value V, the old expected value A, and the new value B to be modified. If and only if the old expected value A is the same as the old memory value V, modify the memory value V to B, otherwise do nothing.

The pseudo code of CAS comparison and exchange can be expressed as:

do{
    
    
		备份旧数据;
		基于旧数据构造新数据;
}while(!CAS(内存地址, 备份的旧数据, 新数据))

Process example:
Insert picture description here
Implementation: Thread 1 and Thread 2 update variable 10 at the same time
Analysis: Because Thread 1 and Thread 2 access the same variable 10 at the same time, they will completely copy the value of the main memory to their working memory space, so the thread The expected value of 1 and thread 2 are both 10; assuming that thread 1 is in competition with thread 2, thread 1 can update the value of the variable, and thread 2 fails (the failed thread will not be suspended, but will be told This competition fails and you can try again). Thread 1 updates the data to 11, and then writes it to memory. At this time, for thread 2, the memory value becomes 11, which is inconsistent with the expected value of 10, and the operation fails.

That is, when the two are compared, if they are equal, it proves that the shared data has not been modified, replaced with a new value, and then continues to run; if they are not equal, the shared data has been modified, abandon the operation that has been done, and then restart Perform the operation just now.

It is easy to see that the CAS operation is based on the assumption that the shared data will not be modified, using a commit-retry model similar to the database. When there are few chances of synchronization conflicts, this assumption can bring greater performance improvements.


2.3 CAS overhead

CAS is a CPU instruction level operation with only one step atomic operation, so it is very fast.
Moreover, CAS avoids the problem of requesting the operating system to determine the lock, and it can be done directly inside the CPU without bothering the operating system.

Is there no overhead for CAS? …Do not! There is a cache miss .

In order to clarify the CAS overhead, we first need to understand the hardware architecture of the CPU :
Insert picture description here
As you can see in the above figure, an 8-core CPU computer system, each CPU has a cache (the internal CPU cache, registers), and the die also contains An interconnection module enables the two cores in the die to communicate with each other.

The system interconnection module in the center of the figure allows four die to communicate with each other and connect the die to the main memory. Data is transmitted in the system in units of "cache lines", which correspond to a power-of-two byte block in memory, usually between 32 and 256 bytes in size.

When the CPU reads a variable from memory to its register, it must first read the cache line containing the variable to the CPU cache. Similarly, when the CPU stores a value in a register to memory, it must not only read the cache line containing the value to the CPU cache, but also must ensure that no other CPU has a copy of the cache line.

Eg:
If CPU0 is performing "compare and swap" (CAS operation) on the first variable, and the cache line where the variable is located is in the cache of CPU7, the following simplified sequence of events will occur:
1) CPU0 checks local Cache, no cache line was found.
2) The request is forwarded to the interconnection module of CPU0 and CPU1, and the local cache of CPU1 is checked, and no cache line is found.
3) The request is forwarded to the system interconnection module, the other three die are checked, and it is learned that the cache line is held by the die where CPU6 and CPU7 are located.
4) The request is forwarded to the interconnection module of CPU6 and CPU7, check the caches of these two CPUs, and find the cache line in the cache of CPU7.
5) CPU7 sends the cache line to its interconnection module and refreshes the cache line in its own cache.
6) The interconnection module of CPU6 and CPU7 sends the cache line to the system interconnection module.
7) The system interconnection module sends the cache line to the interconnection module of CPU0 and CPU1.
8) The interconnection module of CPU0 and CPU1 sends the cache line to the cache of CPU0.
9) CPU0 can now perform CAS operations on the variables in the cache.

Eg analysis:
1) The CAS operation in the best case consumes about 40 nanoseconds, which is more than 60 clock cycles. The "best case" here means that the CPU that performs the CAS operation on a variable is exactly the last CPU that operates the variable, so the corresponding cache line is already in the CPU's cache.
2) The lock operation in the best case (a "round trip pair" includes acquiring the lock and subsequently releasing the lock) takes more than 60 nanoseconds and more than 100 clock cycles. The "best case" here means that the data structure used to represent the lock is already in the cache of the CPU that acquired and released the lock.
3) The lock operation is more time-consuming than the CAS operation.
4) Two atomic operations are required in the data structure for the lock operation. The cache miss consumes about 140 nanoseconds, which is more than 200 clock cycles. The CAS operation that needs to query the old value of the variable while storing the new value takes about 300 nanoseconds, which is more than 500 clock cycles. Think about this, in the time of performing a CAS operation, the CPU can execute 500 ordinary instructions. This shows the limitations of fine-grained locks.


2.4 CAS attributes

  • CAS is actually a kind of lock, but the granularity is different from general locks.
  • CAS open source application scenarios: ZeroMQ, Disruptor.
  • The CAS performance is not very high when it is multi-threaded.

2.5 ABA problem

1. Problem description:
Assume that two threads T1 and T2 access the same variable V. When T1 accesses variable V, the value of V is read as A; at this time, thread T1 is preempted, T2 starts to execute, and T2 first changes the variable The value of V changed from A to B, and then the variable V was changed from B back to A. At this time, T1 seized the initiative again and continued execution. It found that the value of variable V was still A, thinking that there was no change, so Continued execution. In this process, the variable V changes from A to B, and then from B to A, which is vividly called the ABA problem.

The above description does not seem to cause any problems. It should not be a problem to judge the value of V in T1 as A. Whether it is the initial A or the A after ABA, the result of the judgment should be the same.

2.
Solution The idea of ​​solution is to introduce version number control similar to optimistic lock, not only compare the expected value and the value of the memory location, but also compare whether the version number is correct.

3. Implementation case
Starting from JDK5, the atomic package provides the AtomicStampedReference class to solve the ABA problem. Compared with CAS, it introduces a flag. After comparing the expected value with the memory address value, compare the expected flag with the existing flag. The update operation is performed only after all passes.

Guess you like

Origin blog.csdn.net/locahuang/article/details/111030632