Deep thinking about JMM and memory barriers

Base

Memory barriers and their application within the JVM

Memory barrier related issues

After using the StoreLoad barrier, after the data is written back to the memory, it may be reloaded to the cache line. Is this the reason for the time-consuming of storeLoad?

The StoreLoad memory barrier is the most expensive memory barrier. It is mainly to solve the reordering problem of "Store-Load" in a multiprocessor environment and ensure its consistency.

The StoreLoad barrier ensures that all writes before the barrier complete before all reads after the barrier, by flushing the write cache and blocking subsequent reads until all previous writes complete.

This process can cause the processor to stall as it has to wait for all previous writes to complete, which may involve writing data from the write cache back to main memory, and possibly reloading the data from main memory. The latency involved in this process is mainly due to the higher latency of memory operations (read/write) compared to cache operations (read/write), and the possible need for communication between the processor and main memory, which will increase the processor's waiting time.

So the time-consuming portion of the StoreLoad barrier does partly stem from the reloading of data to the cache line that may be required. But more critically, it needs to wait for all previous write operations to complete, including flushing the data in the write cache to main memory, which makes the processor have to wait until these operations complete.

Doesn't StoreStore also need to flush the data in the cache line to the main memory?

The StoreStore memory barrier does require that all previous store (write) operations complete before any store operations after the barrier. This means that in some cases it may indeed need to flush data from cache lines into main memory. However, there are some differences in how it works and the performance impact of the StoreLoad barrier.

The main goal of StoreStore barriers is to prevent reordering of store operations, thereby ensuring the order of writes. For example, if there are two write operations W1 and W2, then any write operation after the StoreStore barrier will not be reordered before W1 or W2.

However, StoreStore barriers usually do not need to flush all data to main memory at once. Instead, it may simply require all previous store operations to complete before any subsequent store operations can be performed. This means that data may still reside in the write buffer instead of being flushed to main memory immediately.

So while it is true that a StoreStore barrier may involve flushing data from a cache line into main memory, this is not always necessary, nor is it the primary goal of the barrier. Therefore, StoreStore barriers generally have less performance impact than StoreLoad barriers.

What about loadStore?

A LoadStore memory barrier ensures that all load (read) operations before the barrier complete before all store (write) operations after the barrier. It is mainly to prevent the reordering of read and write operations.

The LoadStore memory barrier does not need to flush data from the cache to main memory, but ensures that all read operations are completed before performing write operations. This usually means that the read operation is complete and the data has been loaded into the cache before the write operation is performed.

Therefore, a LoadStore memory barrier is generally less impactful on performance than a StoreStore and StoreLoad memory barrier, since it does not need to wait for data to be flushed from cache to main memory.

Note that specific implementation details and performance impacts may vary depending on factors such as the architecture and memory model of the processor and the specific implementation of the JVM.

As far as MESI's StoreLoad is concerned, Store behaves as invalidating the caches in other cpus, and then writes the data in the current cache line back into the memory, and Load means that other cpus grab the newly written data from the main memory ?

Your understanding is basically correct. In the MESI (Modified, Exclusive, Shared, Invalid) protocol, when the processor core performs a write (Store) operation, it first invalidates the cache line status corresponding to other cores (Invalid), and then modifies the local cache line . Once the data of this cache line needs to be written back to the main memory, the status of the cache line will change from Modified to Exclusive, indicating that the cache line has been consistent with the data in the main memory.

In the next read (Load) operation, if other processor cores need to read this data, they will load the data just written from the main memory into their own cache line, and update the state of the cache line Set to Shared.

StoreLoad barriers ensure that writes from one core are visible to reads from other cores by allowing the write to complete (including the necessary data synchronization to main memory) before the read. This process may involve multiple memory operations, including writing, invalidating cache lines of other cores, synchronizing data to main memory, and loading data from main memory, so StoreLoad is the most expensive memory barrier.

It should be noted that this process is handled automatically by the hardware, and the programmer usually does not (and cannot) directly control this process. At the same time, due to the optimization of modern processors and operating systems, the actual process may be more complex and efficient than the above description.

Relationship between MESI and memory barriers? The memory barrier of JMM belongs to an abstraction that maintains the visibility and order of variables, and MESI is a kind of its specific implementation, which is applied to specific scenarios.

A memory barrier in the Java Memory Model (JMM) is an abstraction mechanism for ensuring the visibility and ordering of memory operations (read/write) in a multithreaded environment. The MESI (Modified, Exclusive, Shared, Invalid) protocol is a specific mechanism for achieving cache coherence in a multi-core processor system.

The working scenario of the MESI protocol is a multi-processor (or multi-core) system. In such a system, each processor (or core) has its own cache. When multiple processors operate the same memory at the same time, the value in the cache of one processor may have been modified, while the value in the cache of other processors is still old, which leads to the so-called cache inconsistency problem. The MESI protocol is proposed to solve this problem.

In the MESI protocol, each cache line has four states: Modified, Exclusive, Shared, Invalid. Through these four state transitions, the MESI protocol ensures that each processor sees the latest data.

For JMM, it doesn't care how the bottom layer is implemented, it just needs to know that there is a mechanism (ie memory barrier) to ensure the visibility and order of memory operations. For the bottom layer, in order to realize this mechanism, a protocol like MESI may be used. So, you can think of the memory barrier as a "bridge" between the upper layer (JMM) and the lower layer (MESI).

That is to say, the MESI protocol is a possible specific means to realize the JMM memory barrier, but it is not the only way, and there may be other hardware mechanisms or protocols. It depends on the specific hardware and operating system implementation.

DCL implementation of singleton mode and volatile memory barrier

Why is the implementation of DCL without the volatile keyword at risk of reordering?

  1. A code example for DCL (Double Checked Locking) is as follows:
 private static Singleton instance;

    public static Singleton getInstance() {
    
    
        if (instance == null) {
    
       // first check
            synchronized(Singleton.class) {
    
    
                if (instance == null) {
    
       // second check
                    instance = new Singleton();
                }
            }
        }
        return instance;
    }
    
    private Singleton() {
    
    
        // initialize
    }
}

The jvm instructions for this code are roughly as follows:

0: getstatic     #2      // Field instance:LSingleton;
3: ifnonnull     32     // if instance is not null, jump to 32
6: ldc           #3     // Class reference to Singleton
8: dup           // Duplicate the class reference
9: astore_1      // Store the reference in local variable 1
10: monitorenter  // Enter synchronized block
11: getstatic     #2    // Field instance:LSingleton;
14: ifnonnull     22   // if instance is not null, jump to 22
17: new           #3   // Create new Singleton
20: putstatic     #2   // Assign to instance
21: aload_1       // Load the reference from local variable 1
22: monitorexit   // Exit synchronized block
23: goto          32   // Jump to 32
24: astore_2      // Exception handling: store the exception
25: aload_1       // Load the reference from local variable 1
26: monitorexit   // In case of an exception, exit synchronized block
27: aload_2       // Load the exception
28: athrow        // Re-throw the exception
29: getstatic     #2  // Field instance:LSingleton;
32: areturn       // Return the instance

In this bytecode, the new instruction is used to create a new Singleton object, and the putstatic instruction is used to assign the reference of the newly created object to the instance field. However, there is an important problem here. The problem is a feature in the Java memory model that, without proper synchronization, one thread's writes to an object (here an instantiation of a Singleton) may not be visible to other threads, or may be seen by other threads Partially initialized object.

**In the above bytecode, the main problem occurs between the new and putstatic instructions. There is a problem called "constructor traversal", that is, the order of the two instructions new and putstatic may be reordered by the JVM. Specifically, the JVM may execute the putstatic instruction first, assign a Singleton object that has not been fully initialized to the instance field, and then execute the new instruction to initialize the object. In a multi-threaded environment, if another thread executes the getInstance() method at this time, it may get a Singleton that has not been fully initialized.

What kind of barrier is generally used when using the volatile keyword like a DCL mechanism? (Byte interview question: The effect of putting ll ls before and after, in fact, corresponds to volatilethe way of using the memory barrier when reading a variable)

In Java, the Double Checked Locking (DCL, Double Checked Locking) pattern is often used to implement lazy initialization and singleton patterns. In this mode, volatilekeywords are usually used to ensure the atomicity and visibility of object initialization. Specifically, this is because Java's memory model volatileprovides special memory barriers for reading and writing variables:

  • For write operations on volatile variables, a StoreStore barrier will be inserted after the write operation to prevent subsequent write operations from being reordered before volatile write operations, and a StoreLoad barrier will also be inserted to prevent subsequent read and write operations from being reordered to volatile before the write operation.

  • For a read operation on a volatile variable, a LoadLoad barrier will be inserted after the read operation to prevent the previous read operation from being reordered after the volatile read operation, and a LoadStore barrier will also be inserted to prevent the previous read operation from being reordered to the subsequent before the write operation.

In the DCL mode, volatilethe keyword ensures that when the singleton object is initialized, all fields of the object have been correctly initialized, and this initialization process is visible to all threads. This is mainly achieved through the aforementioned StoreStore and StoreLoad memory barriers.

At the same time, volatilethe keyword also ensures that when checking whether the singleton object has been initialized, if the object has been initialized, then volatileall operations after this check operation (ie read variable) can see the correct state of this object. This is mainly achieved through the aforementioned LoadLoad and LoadStore memory barriers.

Why are these memory barriers inserted after volatile reads and writes variables, not before? (important)

For the semantics of the volatile keyword in the Java memory model, the setting of the memory barrier has specific logic.

For writes to volatile variables, memory barriers are inserted after the write. The reason is that we want to ensure that when the volatile variable is written, other threads will see the latest value. That is, the inserted StoreStore and StoreLoad barriers ensure that written volatile variables will not be reordered with subsequent operations to ensure that other threads can see the latest values.

For read operations on volatile variables, memory barriers are inserted after the read operation. The reason is, we want to be able to read the latest value. The inserted LoadLoad and LoadStore barriers prevent any read and write operations before the read operation from being rearranged after the read operation, ensuring that the latest value is read (why prevent the read operation from being read before the subsequent write operation As for the latest content, isn’t it all read, and there is no write? This is because Load is the real data loaded from the main memory to the cache, and the data that is generally read may be old data in the cache).

In general, for the volatile keyword, the insertion position of the memory barrier is to ensure its semantics: when a thread writes to a volatile variable, other threads can immediately see the effect of the write operation, that is, the volatile variable is realized visibility. At the same time, the volatile keyword also prevents the compiler from reordering the code related to the volatile variable, thereby ensuring that the execution order of the instructions meets the programmer's expectations, that is, the order of the volatile variable is realized.

Are operations between volatile variables still possible to be rearranged?

Although the read/write operations of volatile variables themselves will not be reordered, this does not mean that all operations will not be reordered. For example, operations between volatile variables may still be reordered . Only if there is a data dependency between a read/write operation on a volatile variable and a write/read operation on another volatile variable, the two operations will not be reordered.

Why does volatile not support atomic operations?

Why can't volatile guarantee atomicity?

volatile modifies variable a, how many memory barriers will be inserted when multi-thread calls i++?

volatileKeywords guarantee visibility and ordering of variables, but do not guarantee atomicity. When performing i++ operations on volatilemodified variables in a multi-threaded environment (including three steps of reading value, increasing value, and writing back value), these three steps are not atomic operations and may be interrupted by other threads.

Regarding memory barriers, the Java memory model stipulates that for volatilevariable write operations, the JVM will insert a StoreStore barrier and a StoreLoad barrier after the write operation. These two barriers ensure the order and visibility of write operations; for variable read volatileoperations , JVM will insert a LoadLoad barrier and a LoadStore barrier after the read operation, these two barriers ensure the order and visibility of the read operation.

volatileFor i++ operations on modified variables in a multi-threaded environment :

  1. First, a read operation is performed to read the current value, and a LoadLoad barrier and a LoadStore barrier are inserted after the read operation.

  2. Then, the operation of increasing the value is performed. This operation itself does not insert a memory barrier, because the operation of increasing the value does not involve volatiledirect manipulation of the variable.

  3. Finally, the operation of writing back the value is performed, and the increased value is written back to volatilethe variable, and a StoreStore barrier and a StoreLoad barrier are inserted after the write operation.

Therefore, for volatilethe i++ operation on the modified variable, four memory barriers may be inserted. But this still cannot solve the concurrency problem, because the i++ operation is not an atomic operation, and data inconsistency may still occur in a multi-threaded environment. If you need to solve this problem, you can consider using the AtomicIntegeror synchronizedkeyword to achieve the atomicity of i++ operations.

Can you give an example where volatile modified i++ cannot guarantee atomicity in terms of data read and write in cpu, cache, and main memory?

sure. Suppose we have two threads A and B, both trying to perform volatileVariable++an operation. The following are possible executions:

  1. Thread A reads volatileVariablethe variable from the main memory to its own working memory, assuming the value is 1.

  2. At the same time, thread B also reads volatileVariablethe variable from the main memory to its own working memory, which also has a value of 1.

  3. Thread A performs a +1 operation in its own working memory, which volatileVariableis now 2.

  4. Thread A volatileVariablewrites the new value of 2 back to main memory.

  5. Thread B also performs the +1 operation in its own working memory, which volatileVariableis now 2 (but should be 3).

  6. Thread B volatileVariablewrites the new value of 2 back to main memory.

Although volatile ensures that thread B can see the latest value read by thread A in step 1 in step 3, because thread A has not had time to write back the new value, thread B has already started to perform its operation, which leads to The nonatomicity of the i++ operation is guaranteed.

Therefore, even volatileVariableif modified by volatile, we have encountered a non-atomic problem in a concurrent environment. In theory volatileVariableit should be 3, but in reality it is 2. While volatile can guarantee the visibility and ordering of variables, it cannot guarantee volatileVariable++the atomicity of compound operations such as .

For this case, we usually use atomic classes such as AtomicIntegeror synchronizedkeywords to ensure the atomicity of operations.

Note that the thread working memory here is an abstract concept, and it does not directly correspond to registers or caches on the hardware. It is actually used to indicate the memory area used by the thread. This area may contain CPU caches, registers, and may also contain other hardware and compiler optimizations.

When we say that a thread reads a variable from main memory to its own working memory, it means that the thread gets the latest value of this variable and saves this latest value in a memory area that it can access. Specifically, how this memory area is implemented on the hardware is transparent to Java programs. The Java memory model is concerned with the order and visibility of read and write operations, not how these operations are implemented on the hardware.

That is to say, even if the volatile write is immediately flushed to the memory, it is possible that before the refresh (for example, when the register is still auto-incrementing, another thread has already fetched the value), there is a thread that reads the old value in the memory. Yeah? For example, before the CPU modifies the variable, a thread reads the old value

Yes, your understanding is correct. The volatile keyword ensures that after a thread writes to a volatile variable, the changes are flushed to main memory immediately. However, if another thread has read the value of this variable before the refresh operation is completed, then this thread still reads the old value.

This is why the compound operation (read-modify-writeback) of "i++" is not thread-safe in a multi-threaded environment, even if the variable of the operation is volatile. Between the read of "i++" and the write back, another thread might read the old value and act on it, resulting in an incorrect result. In order to avoid this problem, other synchronization mechanisms (such as locks or atomic variables) need to be used to ensure the thread safety of this composite operation.

Since thread A has not had time to write back the new value, thread B has already started to perform its operation, why is this happening?

Because modern processors are multi-core and multi-cpu, each cpu has its own registers and cache lines, so the execution status of threads is different.

Guess you like

Origin blog.csdn.net/yxg520s/article/details/131897150