In layman's language computer composition principle: understanding Disruptor (on) - with your experience of driving fast CPU cache (say 54)

First, the primer

Adhere to in the end is victory, and finally we ⼀ up a theme to the last column. Let me take you take a look, in the end how fast CPU. In the next two lectures, I will take you look at an open source project Disruptor.
Look at how we use the CPU and cache of hardware features, designed for a performance limit pursuit of the system.

I do not know if you remember, talking about the 37th, in order to optimize the laying of 4 milliseconds special story fiber. In fact, most concerned about is not the ultimate performance of Internet companies, but high-frequency trading firms. Today we explain the Disruptor is a specialized
to do high-frequency trading firm LMAX open out.

Interestingly, Disruptor development language, not the minds of many people most likely to achieve the performance limits of the C / C ++, but performance is limited by the JVM Java. This in the end is how it happened meaning? This means that by speaking,
you can appreciate, in fact, as long as the hardware level proficient in principle, even high-level language such as Java, it is also able to play CPU performance to the limit.

Two, PaddingCache Line, to experience the power of high-speed cache

Let's take a look at Disruptor inside a magical code. This code, Disruptor in RingBufferPad this class which defines p1, p2 up to seven long this type of variable p7.

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

1, a cache line fill

My first reaction after seeing this code is made variable names are not standardized, p1-p7 variable names such there is no clear sense ah. However, when I understand the Disruptor design and source code, only to find that these variables were made appropriately. Because these variables is not actually intended
meaning, just help us a cache line fill (Padding Cache Line), allows us to spend CPU Cache (CPU Cache) as much as possible. Then fill the cache memory in the end is what kind of technology it black? We then look down.

I do not know if you remember, we are talking about the inside of the table 35. If you access the built-in CPU or in the L1 Cache L2 Cache, memory access latency is 1/15 or even 1/100. The memory access speed is actually much slower than the CPU.
Want to pursue extreme performance, we need as much as possible to get the data from the CPU Cache inside, rather than take the data from the memory inside.

2, CPU load data from memory to the CPU Cache inside when a variable is not a variable load, but the load of fixed length CacheLine

CPU Cache memory inside the load data, a field is not a load, but a load the entire cache memory. For example, if we define the length of a long array of type 64. Then the data from memory into the inside of the CPU Cache when
not loaded one by one array element, but the disposable loading a cache line of a fixed length.

We are now 64-bit Intel CPU computers, the cache line is usually 64 bytes (Bytes). A long type of data requires 8 bytes, so all of a sudden we will load eight long type of data. That is, a load inside a continuous array of eight values.
This makes loading times we traverse the array elements quickly. Because the back data cache access will hit seven consecutive times, you do not need to re-read the data from the memory inside. The performance level of benefits, I demonstrated it for you in 37 say the first example of it, the impression is not deep, you can go back and see.

3, for the individual variables inside the class definition, is not easy to enjoy the CPU Cache bonus

However, when we are not using an array, but the use of a single variable, there will be a problem. In RingBuffer (ring buffer) code Disruptor inside, defines a single variable of type long.
This variable is called INITIAL_CURSOR_VALUE, the elements used to store the starting position RingBuffer.

CPU when loading data, this data will naturally be loaded from the cache memory to come inside. However, at this time, which in addition to the data cache, but also load other variables before and after the data definitions. This time, the question came.
Disruptor is a multithreaded server framework, other variables defined in the before and after this data may be several different threads to update the data, read the data. These ones

Write and read requests, it will be different from the CPU Core. So, in order to ensure data synchronization update, we had to CPUCache which data is written back to memory again go inside or reload the data from the memory inside.

And we just said, these write-back and load the CPU Cache, and not based on a variable as a unit. These actions are based on the entire Cache Line as a unit. So, when those variables before and after the INITIAL_CURSOR_VALUE be written back to memory, when
the field and he wrote back to the memory, this constant cache also fails. When we want to read that value again when re-read again from memory. This means that the reading speed is much slower.

......


abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}
	


abstract class RingBufferFields<E> extends RingBufferPad
{
    ......
}


public final class RingBuffer<E> extends RingBufferFields<E> implements Cursored, EventSequencer<E>, EventSink<E>
{
    public static final long INITIAL_CURSOR_VALUE = Sequence.INITIAL_VALUE;
    protected long p1, p2, p3, p4, p5, p6, p7;
    ......

Faced with such a situation, Disruptor was invented a magic code tips, the trick is to fill the cache line. Disruptor around INITIAL_CURSOR_VALUE respectively defines seven long variable type. 7 from the front RingBufferPad inherited
class, the latter 7 is directly inside the class definition RingBuffer. These 14 variables without any real purpose. We neither read them nor to write them.

The INITIAL_CURSOR_VALUE is a constant, it will not be modified. So, once it has been loaded into the CPU Cache, as long as the read access frequently, it will not be swapped out of the Cache. This means that, for this read speed values,
will be the access speed has been CPUCache, rather than the memory access speed.

Third, the use RingBuffer, the use of cache and branch prediction

In fact, the use of performance CPU Cache thinking, throughout the entire Disruptor. Disruptor entire framework, in fact, a high-speed producer - consumer model queues (Producer-Consumer).
Producers kept to queue inside the production of new tasks need to be addressed, and consumers kept inside away from the queue handle these tasks.

1, to implement a queue, the most appropriate list data structure should be

If you are familiar with algorithms and data structures, you should be very clear, if you want to implement a queue, the most appropriate data structure should be linked list. As long as we maintain a good head and tail of the list, you can easily implement a queue.
Producers continue to go as long as the end of the list continues to insert new nodes, and consumers just need to keep the process from head out the oldest node just fine. We can easily realize the producer - consumer model. In fact,

Java library which will have their own infrastructure LinkedBlockingQueue such a queue library, can be used directly in the producer - consumer model.

2, Disruptor which did not use LinkedBlockingQueue, but the use of such a data structure RingBuffer

However, there did not use a LinkedBlockingQueue the Disruptor, but the use of such a data structure RingBuffer, the underlying implementation is RingBuffer is a fixed length array. Compared to the list in the form of implementation,
the data array will exist in the spatial locality in memory.

As we have seen above, a plurality of continuous element of the array will be loaded into the CPU Cache together inside, so they are accessible traversal faster. The data inside each node of the list, probably will not appear in the adjacent memory space,
naturally enjoy the advantages of less than the entire Cache Line after loading the data in a row is accessed from the cache memory to the inside.

In addition, there is a great advantage to traverse accessed data is CPU-level branch prediction will be accurate. This principle allows us to be more of this part if you already do not remember,
you can go back and review the contents of the first 25 speak about branch prediction.

IV Summary extension

Well, I do not know these finished, did you feel the magic Disruptor this framework it?

CPU memory to load data from the Cache inside the CPU, when a variable is not a variable load, but the load CacheLine fixed length. If the data is loaded inside the array, then the CPU loads inside the array to a plurality of consecutive data.
So, loop through the array of CPU Cache is easy to enjoy the dividends of that maneuvering speed brings.

For single variable inside the class definition, it is not easy to enjoy the CPU Cache bonus. Because although these fields will be assigned together, but the actual application often has little to do at the memory level. Thus, the case of multiple CPU Core visit, there will be
the case frequently in the data memory and CPU Cache inside to and fro. The Disruptor is very tricky in constant need of frequent high-speed access

Front INITIAL_CURSOR_VALUE, each definition is not seven long as any type of variable Use and write requests.

In this way, no matter what position memory on, CacheLine this INITIAL_CURSOR_VALUE where there will not be any written request updates. We can always read its value in the Cache Line the inside, without the need to read data from memory inside,
it will greatly accelerate the performance of the Disruptor.

This line of thought, in fact, permeates all aspects Disruptor open source framework. As a producer - consumer model, Disruptor did not choose to implement a queue using a linked list, but the use of RingBuffer.
RingBuffer underlying data structure is an array of a fixed length. This array is not only easier for us to make good use of CPU Cache, the CPU executes the branch prediction process is also very beneficial. More accurate branch prediction,
can make better use of our pipeline of good CPU, make the code run faster.

Guess you like

Origin www.cnblogs.com/luoahong/p/11518304.html