Difficulties: Why is the system avalanche?

This Tuesday I participated in the "2021 Creation Support Program" presentation held by CSDN. The author was completely infected by the passion and perseverance of Vice President Yu Bangxu. To be honest, the author, who has been nearly a year old, has rarely been like this recently. I'm impressed, or an IT veteran like me, in addition to contributing some commentary articles to the CSDN public account, also has the opportunity to share some experience with readers at the purely technical level.

I tried to write a few articles in the "Difficult and Miscellaneous Diseases" series a few years later, but when analyzing the causes of failures, I needed to quickly determine the direction of the problem to be solved, but the "metacognition" of the underlying mechanism was needed. Or take the example of the "2021 Creation Support Plan". During the live broadcast, some netizens asked why the CSDN website always collapsed? At that time, Mr. Yu mentioned that it is difficult for a website deployed with the Spring boot architecture to crash. Once it crashes, it is basically a misuse of Redis. This is actually the underlying mechanism for solving intractable diseases. Of course, it is related to Redis. Problem handling summary We will have a detailed introduction in the next article of this series. This article first answers how to establish a sense of how to quickly judge the direction of problem solving.

Types of avalanches

The avalanche effect is a situation often encountered in our production environment. It generally refers to the system running well at ordinary times, and the resources are also redundant, but it cannot be affected by sudden traffic or abnormal conditions. As long as the outside is pressurized, or there is Other situations that exceed expectations will collapse instantly. Avalanches are mainly divided into two situations, both of which are related to critical values:

Transaction concurrency problem: Once the number of concurrent transactions exceeds a certain critical value, the processing efficiency of the system will drop exponentially. An example is as follows. If under normal circumstances, 100 concurrent transactions can be processed per second, but if the transaction volume reaches the critical value of 200, then the processing capacity of the system may plummet to less than 10 transactions.

Problem of the proportion of failed transactions: Once the proportion of failed transactions exceeds a certain threshold, the system's processing capacity will also drop exponentially.

There are many reasons for the plummeting of the system's processing capacity, but the essence is often caused by the penalty mechanism of the CPU pipeline and the CPU cache hit mechanism. Next, let's start with the pipeline model of the processor.

Why are resources redundant, but still avalanche

We know that all actions of electronic computers are triggered by crystal oscillations. Therefore, the oscillation frequency of the CPU is also called the main frequency, which is a direct manifestation of the CPU processing performance. However, as Moore's law is gradually ending, simply increase the CPU main frequency to obtain The road to higher performance has gradually become impossible, so chip manufacturers have begun to optimize in the direction of improving instruction processing efficiency.

Take the addition ADD instruction as an example. In order to complete this execution instruction, several steps such as fetching instructions, decoding, fetching operands, execution and fetching operation results are required, and each step requires a crystal oscillation to advance, so before the emergence of pipeline technology It takes at least 5 to 6 crystal oscillation cycles to execute an instruction , as shown in the following figure:

In response to such problems, chip designers proposed the idea of ​​referring to the factory assembly line mechanism, because the modules of fetching and decoding are actually independent, and the completion can be done in parallel at the same time, so as long as the relevant steps of multiple instructions are put Execute at the same time, such as instruction 1 fetching, instruction 2 decoding, instruction 3 fetching operands, etc., and so on, to achieve a significant increase in the execution effect of the CPU. Take the 5-stage pipeline as an example. The specific principles are detailed below Picture:

 

 

As can be seen from the above figure, T5 is the fifth oscillation cycle, and the instruction pipeline is established. Since then, every oscillation cycle T can get the result of one instruction, which means that on average, only one instruction is needed. The oscillation cycle can be completed.

 

Of course, the establishment of the instruction pipeline also has a prerequisite, that is, the CPU must be able to know the execution order of instructions in advance. This is the instruction prediction technology, because once it predicts the next instruction to be executed incorrectly, the CPU must clear all current pipelines and restart from the correct instruction fetch executed, and the CPU has degraded to require five cycles to execute an instruction shock point, which also is one of the culprits causing avalanches.

 

According to Intel's statistics, the average CPU will encounter a possible jump every time it executes 7 instructions. That is to say, an if judgment occurs every 7 lines of code on average, and every time a judgment is made, the program may jump. Once the jump causes inaccurate instruction prediction, it will double the efficiency of the CPU.

 

Therefore, the general CPU provides the function of instruction prediction at the assembly language layer, and the high-level language generally also provides modifiers for the CPU to perform instruction prediction. We can often see the modification of unlike in the Linux kernel code, which is essentially for business. The logic has no effect, but it prompts the CPU that this judgment result is unlikely to appear. Do not put this code into the pipeline to ensure the maximum operating efficiency of the system during normal operation.

 

Generally speaking, during an avalanche, this unlike branch is executed on a large scale. Since the unlike branch is basically impossible to be executed under normal circumstances , adding this modifier is very helpful to improve the execution efficiency under normal circumstances , but once the occurrence of such large-scale popular branch is executed, punishing effect should be very obvious. This is also once the error situation exceeds the threshold, its impact on efficiency is doubled.

CPU caching can also cause trouble

Generally, programmers often have a misunderstanding that memory access speed is very fast, but this is not the case. Taking Intel's CPU as an example, it is generally divided into three levels of cache. Among them, the speed of the first level cache is at the register level and only a few accesses are required. Instruction cycle, the second-level cache is about 6 to 8 times slower than the first-level cache, the third-level cache is 8 times slower than the second-level cache, and the memory is 10 times slower than the third-level cache. In fact, the speed of the memory is only relative to the disk, and the speed of the memory is not enough compared with the CPU cache.

In order to clarify this problem, here is a brief review of the three mechanisms of memory mapping to CPU cache direct connection, full connection and group connection.

Direct connection: The simplest is direct connection. The general memory address of the CPU using this strategy is divided into area code + block number + address within the block for addressing. The specific principle is as follows:

 

Under this strategy, the number of areas in the memory and the cache is different. Generally, the number of areas in the memory is much larger than that of the main memory. For example, there may be 1024 areas in the main memory, but there may be only 16 areas in the memory. Assuming that the length of each area is 2^k, then no matter which area of ​​the main memory the memory block to be mapped is in the CPU cache scheduling, the 0th block can only be placed in the corresponding cache block and can only be mapped to the 0th in the cache. , 2^k+1, 2^k*n+1 (where 2^k is the length of the area address) blocks, that is, the memory can only be mapped to a relatively limited number of blocks, and cannot be placed in other locations , This may cause more problems, because it is possible that the cache is clearly rich, but if the 0th block in all the cache areas is fully occupied, at this time other memory units located in the 0th block can no longer be transferred into the cache. . Therefore, this strategy is rarely used in practice.

Full connection: It should be said that full connection is the most efficient of all caching strategies, because it has no direct connection strategy at all, and all memory units can be connected and stored freely, as shown in the following figure:

 

This also greatly improves the operating efficiency of the CPU cache, but it can also be seen from this dense line that the disadvantage of this is that the cost is too high, all memory units must be connected to the cache, and the circuit design is difficult. It's also very big.

Group connection: In the group associative mapping mode, the cache is divided into 2^k groups, and the memory is divided into u*2^k groups (a positive multiple of the number of memory groups). A direct connection is used between the memory group and the cache group, and a full connection is used with each block in the group. In other words, the m th group of the main memory can only be mapped to the cache group j=m mod 2^k, but the cache block in the memory m group can be mapped to any block in the cache j group. As shown below

 

Group connection is actually a compromise between cost and efficiency, and it is also the most commonly used connection method for mainstream CPUs such as Intel and ARM.

Derivative problems caused by the cache mapping strategy: Since the most used cache strategy is group connection, and because each block in the group is connected, group connection is generally a greedy strategy, that is, if there are other free spaces in the cached group The CPU cache will also map other blocks in the memory group involved in this scheduling to the cache, because such an operation hardly needs to pay extra time cost. That is to say, adjacent memory data in the memory is likely to be mapped into the cache at the same time.

This will also lead to an interesting inference. When you use a[i][j] to traverse one-dimensional data, its efficiency is far adjusted to a[j][i]

for (i=0,i++,i<len1)

{

    for (j=0,j++,j<len2)

{

print(a[i][j];

}

}

That is, the execution efficiency of the above code is much higher than the following one,

for (i=0,i++,i<len1)

{

    for (j=0,j++,j<len2)

{

print(a[j][i];

}

}

The reason here is also very simple, because the rows of the two-dimensional array are continuously distributed in the memory, so when the code reads a[i][j], a[i][j+1], a[i][ j+2] is actually likely to be mapped into the cache by the CPU, but a[j][i] and a[j+1][i] are not continuous in the memory, so when reading a[j][i] a[j+1][i] is unlikely to be read from the cache, and because the cache is much faster than the memory, it will cause such a strange phenomenon.

In the actual production environment, a similar phenomenon will also occur. When the number of concurrency is too high, the CPU cannot process continuous message requests, and continuous context switching is required to ensure that each message is read from the memory first to avoid network card problems. A buffer overflow exception occurred. When the CPU is constantly switching contexts, the continuous remapping of the cache will also greatly affect the execution efficiency of the CPU. This is also one of the important reasons for the avalanche damage of the system.

 

The above is the underlying mechanism of the system avalanche that I have summarized in recent years. Of course, if you want to quickly solve the production problem, it is definitely the skill of ten minutes on stage and ten years of work off stage. This article will first explain the CPU-related mechanism issues. Discuss the underlying mechanism of memory management. The next preliminary plan is a special case specifically for Redis. It also helps CSDN solve possible Redis problems. I hope that through this series of blogs, readers can quickly accumulate experience and become possible as soon as possible. An old expert who quickly solves intractable diseases.

 

 

 

 

Guess you like

Origin blog.csdn.net/BEYONDMA/article/details/115264503