How many times is the speed difference between SSD, memory and L1 Cache

An interview question: How many times is the speed difference between SSD, memory and L1 Cache ?

In fact, compared to complex technical questions, I prefer to ask simple questions like common sense in life in interviews. Because I think that complex problems are composed of simple problems. If you have a solid understanding of simple problems, then complex problems can also be derived by yourself.

If you don't know L1 Cache, you may misjudge memory execution speed. When we write programs, we use registers, memory, and hard disks, so according to Murphy's law, if one of the cognitions here is wrong, then the final result will cause problems.

Why is there a storage tiering policy?

To figure out the storage tiering strategy.

First of all, you have to figure out, "What do we want the memory to look like", that is, "What are our needs"?

Then, you have to figure out what "implementation constraints" our requirements have.

In terms of demand, we hope that the memory has fast speed, small size, large space, low energy consumption, good heat dissipation, and data will not be lost when power is off. But in reality, we often can't meet all the needs.

Here are a few examples to give you a deeper understanding, such as:

  • If a memory is small in size, its storage space will be limited.

  • If a memory electronics density is very high, then heat dissipation will be a problem. Because electronic components will generate heat, the CPU with very concentrated electronic components needs a separate fan or water cooling to help the electronic components cool down.

  • If a memory is farther away from the CPU, there will necessarily be a delay in the transfer, so the transfer speed will also drop.

You may have doubts here, because in most people's cognition, the speed of light is very fast, and signals are transmitted at the speed of light. Since the speed of light is so fast, the delay of the signal should be very small. But this is not the case. For example, the clock signal is a 1GHz CPU, and 1G represents 1 billion, so one cycle of the clock signal is 1/1 billion second. And the speed of light is 3×108 meters per second, which is 300 million meters per second. So in one cycle, the light can only travel 30 cm.

look! Although the speed of light is very fast in the macroscopic world, in the computer world, the speed of light is not as fast as we think. So even if the distance between the components and the CPU is a little bit, the running speed will drop very obviously.

You may still ask, why not put the memory in the CPU?

If you do this, in addition to the heat dissipation and volume problems of the entire circuit, there is no way for the server to make custom memory. That is to say, the CPU determines its memory size when it leaves the factory. If you want to change to a larger memory, you have to change the CPU. Assembly customization is your very important appeal, which is definitely unacceptable.

Also, at the same price, the faster a memory is, the more power it usually consumes. The higher the energy consumption, the greater the heat generation.

Therefore, the requirements we mentioned above cannot be fully met unless there is a disruptive breakthrough in storage technology someday in the future.

Storage Tiering Policy

Since we cannot use a piece of memory to solve all the requirements, we must classify the requirements.

A feasible solution is to use different memories according to the usage frequency of the data: the faster the reading and writing of the data that is used frequently, the better, so use the most expensive materials and put them in the position closest to the CPU; the less frequently used data Data, the farther we put it from the CPU, the cheaper the material.

insert image description here

Specifically, we usually divide the memory into several levels:

  1. register;

  2. L1-Cache;

  3. L2-Cache;

  4. L3-Cahce;

  5. Memory;

  6. Hard disk/SSD.

Register

The register is next to the CPU's control unit and logic calculation unit, and the material speed it uses is also the fastest. As we mentioned earlier, the faster the memory, the higher the energy consumption, the greater the heat generation, and the most expensive, so the quantity should not be large.

The number of registers is usually between tens and hundreds, and each register can be used to store a certain byte of data. for example:

  • Most registers in a 32-bit CPU can store 4 bytes;

  • Most registers in a 64-bit CPU can store 8 bytes.

The access speed of the register machine is very fast, and it is generally required to complete reading and writing within half a CPU clock cycle. For example, an instruction to be completed in 4 cycles, in addition to reading and writing registers, also needs to decode instructions, control instruction execution and calculation. If the registers are too slow, the instruction may not be completed in 4 cycles.

L1-Cache

L1- cache is in the CPU, and although it is located farther from the CPU core than registers, it is less expensive. Usually the size of L1-Cache ranges from dozens of Kb to hundreds of Kb, and the read and write speed is 2~4 CPU clock cycles.

L2-Cache

The L2-cache is also in the CPU, located further from the CPU core than the L1-cache. Its size is larger than L1-Cache, the specific size depends on the CPU model, there are 2M, smaller or larger, and the speed is 10~20 CPU cycles.

L3-Cache

The L3-cache is also in the CPU, located further from the CPU core than the L2-cache. The size is usually larger than L2-Cache, and the read and write speed is 20~60 CPU cycles. The size of the L3 cache also depends on the model. For example, the i9 CPU has a 512KB L1 Cache; a 2MB L2 Cache; and a 16MB L3 Cache.

Memory

The main material of memory is semiconductor silicon, which is plugged into the motherboard to work. Because its location is some distance away from the CPU, it needs to be connected to the CPU with a bus. Because the memory has an independent space, the volume is larger and the cost is much lower than the memory mentioned above. Some personal computers now have 16G of memory, but some servers have memory of several T. The memory speed is around 200~300 CPU cycles.

SSD and HDD

SSD is also called solid-state drive. Its structure is similar to that of memory, but its advantage is that the data is still there after power failure. Data in memory, registers, and caches disappears when the power is turned off. The read and write speed of memory is about 10~1000 times faster than that of SSD. In the past, there was also a physical disk for reading and writing, which we also called a hard disk, and its speed was about 100W slower than that of memory. Because its speed is too slow, it has been gradually replaced by SSD.

insert image description here

When the CPU needs a certain data in the memory, if there is this data in the register, we can use it directly; if there is no such data in the register, we need to query the L1 cache first; if there is no such data in L1, then query the L2 cache; if there is no such data in the L2 Check the L3 cache again; if it is not in the L3, go to the memory to get it.

cache entry structure

Above we have introduced what kinds of storage are in the memory hierarchy and their characteristics, and then there are some design difficulties in caching algorithms and data structures that I want to discuss with you. For example, the CPU wants to access a memory address, so how to check whether this data is in the L1-cache? In other words, what are the data structures and algorithms in the cache?

Whether it is a cache or a memory, they are all a linear memory, that is, data is stored next to each other. If we imagine memory as a table with only one column, then the cache is a table with multiple columns, and each row in this table is called a cache entry.

plan 1

A cache is essentially a Key-Value storage, its Key is a memory address, and its value is the value in the memory address at the time of the cache. Let's think about a simple solution first, a cache entry design with 2 columns:

  1. address of memory;

  2. cached value.

When the CPU reads a memory address, we add an entry. When we want to query whether the data of a memory address is in the L1-cache, we can traverse each entry to see if the memory address in the entry is the same as the queried memory address. If it's the same, we fetch the value cached in the entry.

This method needs to traverse every entry in the cache, so the calculation speed will be very slow, and in the worst case, the algorithm needs to check all entries, so this is not a feasible solution.

Scenario 2

In fact, many excellent solutions are often transformed from the stupidest solutions. Now we have a solution, but this solution cannot quickly determine which line a memory address is cached in. So we want to find a better way, let us see a memory address, and we can quickly know which line it is in.

Here, we can use a mathematical method. Say there are 1000 memory addresses but only 10 cache entries. The numbers of memory addresses are 0, 1, 2, 3, ..., 999, and the numbers of cache entries are 0~9. We think of a memory number, say 701, and map it mathematically to a cache entry, say 701 divisible by 10, to get cache entry 1.

In this way, every time we get a memory address, we can quickly determine its cache entry; and then compare whether the memory address in the first column of the cache entry is the same as the query memory address, and then we can determine whether the memory address has is cached.

To extend it, a method similar to a hash table is used here: 地址 % 10, which actually constitutes a simple hash function.

instruction read-ahead

Next we discuss the issue of command pre-reading.

We have learned before that the CPU executes the instructions in the memory sequentially, and the speed at which the CPU executes instructions is very fast, usually 2 to 6 CPU clock cycles; in this lesson, we learned the memory classification strategy and found that the memory read and write speed In fact, it is very slow, about 200~300 clock cycles.

I don't know if you found out? This also creates a very troublesome problem: the CPU cannot actually read instructions from the memory one by one and then execute them. If this is done, it will take 200 to 300 clock cycles for each instruction to be executed.

So, how to deal with this problem?

Let me say one more thing here. When you do RPC calls for business development, you will often encounter this situation. Remote calls slow down the overall execution efficiency. Let's discuss solutions to such problems together.

One solution is that the CPU pre-reads dozens or hundreds of instructions in the memory into the L1-cache with a faster read and write speed, because the read and write speed of the L1-cache is only 2 to 4 clock cycles, which is comparable to on the execution speed of the CPU.

Here another problem arises: if both data and instructions are stored in the L1-cache, if the data cache overwrites the instruction cache, there will be very serious consequences. Therefore, the L1-cache is usually divided into two areas, one is the instruction area and the other is the data area.

At the same time, another problem arises, the L1-cache is divided into instruction area and data area, so does L2/L3 need to be divided like this? Actually, it is not necessary. Because of L2 and L3, there is no need to assist with the problem of instruction read-ahead.

cache hit rate

Next, there is one more important issue that needs to be addressed. That is, the sum of L1/L2/L3, what is the hit rate of the cache?

The so-called hit refers to finding the required data in the cache. The opposite of hit is penetration, also called miss, which means that a read operation does not find the corresponding data from the cache.

According to statistics, the hit rate of L1 cache is about 80%, and the combined hit rate of L1/L2/L3 is about 95%. Therefore, the design of the CPU cache is quite reasonable. Only 5% of memory reads will penetrate to memory, and 95% of them will be read to cache. This is why the programming language gradually cancels the syntax that allows programmers to operate registers, because the cache guarantees a high hit rate, redundant optimization is of little significance, and it is easy to make mistakes.

cache replacement problem

The last question, for example, now that the L1-cache entry is full, and then the CPU reads the memory again, a new entry needs to be stored in the L1-cache. Since there is a new entry to come in, there is a Old entries are going out. Therefore, at this time we need to use an algorithm to calculate which entry should be replaced. This problem is called the cache replacement problem. Regarding the cache replacement issue, I will discuss it with you in "21 | Process Scheduling: What methods are there for process scheduling?".

Summarize

In this lesson we covered memory tiering strategies and discussed how L1/L2/L3 caches work. The content learned in this class is the source of all cache knowledge. The design of all cache systems is the classification of storage resources. When we design the cache, in addition to caring about the overall architecture, we also need to pay attention to details, such as:

  • How are the items designed?

  • How is the algorithm designed?

  • How to count the hit rate?

  • How to replace the cache, etc.?

Questions raised: How many times slower are SSD, RAM and L1 Cache ?

[Analysis] Because memory is 10~1000 times faster than SSD, L1 Cache is about 100 times faster than memory. Therefore, L1 Cache is 1000~100000 times faster than SSD. So have you found that SSD has great potential? A good SSD is already close to memory, but the cost is still slightly higher.

The performance gap between different memories is very large, and it is very meaningful to construct a memory hierarchy. The purpose of the hierarchy is to construct a cache system.

Guess you like

Origin blog.csdn.net/qq_37247026/article/details/131199767