In order to pursue faster, what efforts have CPU, memory, and I/O made?

background

Some time ago, I wrote an article "Top Ten Technologies for High-Performance Development" , and some readers sent me private messages,...

When I was interviewing, there were two things I fear most. One is afraid of the algorithm, and the other is high concurrency.

Algorithms. Since I paid attention to the " Xiaohao Algorithm " and brushed a lot of LeetCode, I found that there are still routines to follow. Although I dare not say how powerful the algorithm is, at least it is not as scared as before (which is strange).

And the second one, high-performance and high-concurrency technology, I feel that there are a lot of technologies to learn. Learning a little from the East and a little from the West is not a system. Until one interview, I met a big cow and asked about this, but the body was abused. Fortunately, this big cow not only has first-class technology, but also earnestly exchanged his learning experience with me, how to sort out this technical knowledge in a systematic system instead of just learning.

CPU

No matter what kind of programming language, what kind of code framework, it is finally executed by the CPU (of course, this is not accurate, there are other situations such as GPU, TPU, coprocessor, of course, this is not the focus of this article) .

So if you want to improve performance and increase concurrency, the first question is how to make the CPU run faster?

This problem is also a direction that CPU manufacturers have been pursuing.

How to make the CPU faster? CPU manufacturers have made two efforts:

  • Speed ​​up instruction execution

  • Speed ​​up the CPU to read data

For the first direction, the speed at which the CPU executes instructions is closely related to the main frequency of the CPU. How to fetch instructions, decode and execute instructions faster, shorten the instruction cycle of the CPU, and increase the main frequency for a long time Both are very effective methods.

From a few hundred MHz to a few GHz now, the CPU frequency has made great progress, and the number of instructions that can be executed in the same time has changed.

For the second direction, how to improve the speed of CPU reading data, the answer is to add cache , use the principle of locality to transfer the frequently accessed data in the memory to the CPU, which greatly improves the access speed.

From the first-level cache to the second-level cache, and even the third-level cache, the level and capacity of the CPU cache are constantly improving, saving a lot of time for reading and writing data.

But as time goes by, especially after entering the 21st century, processor manufacturers have found that it has become more and more difficult to further increase the main frequency, and it is also difficult to further expand the CPU cache.

How to do it? Since it is difficult to improve the speed of working alone, why not find a few people to work together? So, multi-core technology is here. There are multiple cores in a CPU. Everyone paddles the boat, and the CPU speed takes off again~

Even, allowing a core to use "idle resources" to execute another thread in "leisure time", which gave birth to a hyperthreading technology that allows a core to "simultaneously" execute two threads .

The above briefly explained the efforts made by the CPU to improve performance. But the CPU speed alone is useless, and we need to make better use of development, otherwise it will be a waste of CPU computing power.

The thread is mentioned above , yes, how to improve performance and increase concurrency? Using multi-threading technology is certainly a very good idea.

But with the introduction of multithreading, we have to mention two threads-related topics:

  • Thread synchronization

  • Thread blocking

Multiple threads working together will inevitably introduce synchronization problems. The conventional solution is to lock , and the locked thread generally enters blocking.

When a thread encounters blocking, it needs to be switched, and switching has a certain cost overhead, not only the time overhead of system scheduling, but also the loss of CPU cache failure.

If the threads are frequently locked and blocked, the loss is considerable. In order to improve performance, lock-free programming technology has emerged, using the mechanism provided by the CPU to provide a lighter locking scheme.

At the same time, in order to allow the switched thread to still run on the previous CPU core and reduce cache loss, thread CPU affinity binding technology has also appeared.

Modern operating systems are scheduled to be used by multiple threads in the form of time slices. If the time slice is not used up and the execution opportunity is handed over for one reason or another, then the thread is too bad.

Therefore, some people propose to make full use of the CPU, don't let the thread block, hand over the execution rights, and implement the scheduling of multiple execution streams in the application layer. The coroutine technology was born , it doesn't matter if it is blocked, I can do other things, don't easily switch threads.

RAM

The first close partner related to CPU work is memory, and the two can work together to sing a good play.

Increasing the speed of memory access is also an important part of the topic of high-performance development!

How to improve? It is difficult for programmers at the hardware level to change, so we have to work hard at the software level.

Memory management has gone from real address mode to paged memory management. In today’s computers, the addresses that the CPU takes are all virtual addresses. This involves address conversion. Here is an article to do, there are two The direction can work hard:

  • Reduce page faults

  • Use large page technology

Modern operating systems basically use a technology called paging/swap file : memory space is limited, but there are more and more processes, and the demand for memory space is increasing. What should I do if I run out? So divide an area on the hard disk, transfer the data that has not been used for a long time in the memory to this area, when the program is used, trigger an access exception, and then read it from the hard disk in the exception handling function.

It is conceivable that if the memory accessed by the program is always not in the memory, but is swapped to the hard disk, page faults will be triggered frequently, and the performance of the program must be greatly reduced. Therefore, reducing page faults is also a good way to improve performance.

The process of addressing the real physical memory from the virtual address is done by the CPU, specifically, by looking up the table , from page table -> first level page directory -> second page directory -> physical memory.

The page directory and page table are stored in the memory. There is no doubt that memory addressing is a very, very high-frequency event, which happens all the time, and multiple table lookups are bound to be very slow. In view of this, the CPU Introduced a thing called TLB (Translation Look-aside buffer), which uses cached page table entries to reduce memory look-up operations and speed up addressing.

By default, the operating system manages memory pages in units of 4KB . For some server programs that require a lot of memory (Redis, JVM, ElascticSearch, etc.), there are dozens of G at every turn, divided in units of 4KB, that would have to be generated How many page table entries!

The size of the TLB in the CPU is limited. The more memory, the more page table entries, and the greater the probability of TLB cache failure. Therefore, large-page memory technology has emerged. 4KB is too small, so make it bigger. The emergence of large page memory technology has reduced the number of page faults and increased the probability of TLB hits, which is of great help in improving performance.

On some high-configuration servers, the amount of memory is huge, and multiple CPU cores must access the memory through the memory bus. It is conceivable that after the number of CPU cores increases, the competition of the memory bus will inevitably intensify. So the NUMA architecture appeared. The CPU cores were divided into different groups, and each used its own memory to access the bus to improve the memory access speed.

I / O

The CPU and memory are fast enough, but it is still not enough. In the daily work of our program, except for some CPU-intensive programs (performing mathematical operations, encryption and decryption, machine learning, etc.), a considerable part of the time is spent performing I/O, such as reading and writing hard disk files, sending and receiving network data packets and many more.

Therefore, how to improve the speed of I/O is an important topic in the field of high-performance development technology .

Because I/O will involve interaction with peripherals (hard disks, network cards, etc.), and these peripherals are usually very slow (relative to the CPU execution speed), so under normal circumstances, it is inevitable that the thread executes the I/O operation Will block, which is also mentioned in the CPU section earlier.

After blocking, there is no way to work. To be able to work, open multiple threads. However, thread resources are very expensive and cannot be used in large quantities. Moreover, there are too many threads to switch and schedule multiple threads.

Can the thread not block when performing I/O? As a result, new technologies appeared:

  • Non-blocking I/O

  • I/O multiplexing

  • Asynchronous I/O

The original blocking I/O is to wait until the completion of the I/O , non-blocking I/O is generally polling, you can do other things, after a while, I will ask: Is it all right?

But it’s not a problem for each thread to poll. Just leave it to one thread to be responsible for it. This is I/O multiplexing . Through select/poll, only one thread can handle multiple I/O target. Then, make another improvement. With epoll, even polling is no longer needed. Instead, the kernel wake-up notification mechanism is used, and more I/O targets are processed at the same time.

Asynchronous I/O is even better. Set a callback function and do other things by yourself. Just go back to the operating system and ask you to collect data.

Going back to the I/O itself, the data will be transferred between the memory and peripherals. If the amount of data is large and the CPU is used to carry the data, it is time-consuming and has no technical content, which is a great power to the CPU waste.

Therefore, in order to free the CPU from it, another technology was born: direct memory access DMA , outsourcing the data transfer work, and handing over to the DMA controller to complete, the CPU only needs to give orders from behind.

With DMA, there is no need to bother the CPU to perform data handling. But for applications, if you want to send files over the network, you still have to toss the data back and forth between the kernel mode space and the user mode space. These two steps require the CPU to copy and copy, which is a waste. Solve this problem, improve performance, and further produce zero-copy technology , which completely reduces the burden on the CPU.

Algorithm architecture

The CPU, memory, and I/O are fast enough, and the performance of a single computer is hard to improve. However, nowadays servers rarely do it alone. Next, we must turn our attention to algorithms and architectures.

If a server can't handle it, use hardware to pile up performance, and distributed cluster technology and load balancing technology will come in handy.

These years, which back-end service does not have a database? How to make the database faster? The turn indexing technique on the establishment of a database indexed by giving improve the retrieval speed.

But the data of this guy in the database is stored on the hard disk after all, and it is bound to be slow when reading. If a large number of data requests are coming up, who can withstand it? So memory-based database cache Redis and Memcached came into being. After all, accessing memory is much faster than querying from the database.

There are too many technologies in the algorithm architecture, and it is also the only way from an ordinary code farmer to an architect. Let's talk about it next time.

to sum up

High performance and high concurrency are the eternal pursuit of back-end development topics.

Every technology does not appear in a vacuum, it must be proposed to solve a certain problem. When we are learning these technologies, we can get more results with half the effort by grasping the reasons for their emergence and the connection with other technologies, and building a hierarchical map of technological knowledge in our own brains.

Most of the technologies in this picture, I have previously explained the corresponding interesting story articles, welcome everyone to double it~

Regarding this article, what do you want to say, or any important technology that I have missed, please leave a message to discuss.


1. If you understand the error frame and missed detection of CAN, the car factory can't perfuse you!

2. Gartner releases important strategic technology trends in 2021!

3. When designing the circuit, what stupid things cry our youth...

4. What is the difference between Linux x86 and ARM?

5. Explore several issues in the STM32 startup file~

6. Famous companies that use Rust in production and their reasons for choosing!

Disclaimer: This article is reproduced online, and the copyright belongs to the original author. If you are involved in the copyright issues of the work, please contact us, we will confirm the copyright based on the copyright certification materials you provide and pay the author's remuneration or delete the content.

Guess you like

Origin blog.csdn.net/DP29syM41zyGndVF/article/details/110913823