foreword
The three words "zero copy" must have been heard by many of you. This technology is used in various open source components, such as kafka, rocketmq, netty, nginx and other open source frameworks. item technology. So today I want to share some knowledge about zero copy with you.
data transfer in computer
Before introducing zero-copy, I want to talk about the way data is transferred in a computer system. The development of the data transmission system, in order to write this part, I have sacrificed the computer composition principle that I have been dusting for many years:
Early stage:
Scattered connections, serial work, program queries. At this stage, the CPU is like a nanny, and needs to read data from the I/O interface and then send it to the main memory. The specific process of this stage is:
- The CPU actively starts the I/O device
- Then the CPU keeps asking the I/O device old iron if you are ready. Note that this is always asking.
- If the I/O device tells the CPU: I'm ready. The CPU reads data from the I/O interface.
- The CPU then continues to transfer this data to main memory, just like a courier.
This inefficient data transfer process has been occupying the CPU, and the CPU cannot do other more meaningful things.
Interface module and DMA stage
This part is also what we will discuss later.
interface module
In the Von Neumann structure, each component has a separate connection, which is not only the first, but also makes it difficult to expand I/O devices. The early stage above is this system, which is called decentralized connection. Expanding an I/O device requires connecting a lot of wires. Therefore, a bus connection method is introduced, and multiple devices are connected to the same group of buses to form a common transmission channel between devices. This is also the data exchange structure of our home computers or some small calculators.
In this mode, the data exchange adopts the method of program interruption. We know above that after we start the I/O device, we have been asking whether the I/O device is ready. If this stage is removed, the program interruption is very good. Fulfilled our long-cherished wish:
- The CPU actively starts the I/O device.
- After the CPU starts, there is no need to ask I/O again, and it starts to do other things, similar to asynchrony.
- After the I/O is ready, tell the CPU that I'm ready via a bus interrupt.
- The CPU reads the data and transfers it to the main memory.
DMA
Although the above method improves the utilization rate of the CPU, the CPU is still occupied during the interruption. In order to further solve the CPU occupancy, the DMA method is introduced. In the DMA method, the main memory and the I/O device are connected. There is a data path so that when data is exchanged between main memory and I/O devices, there is no need to interrupt the CPU again.
Generally speaking, we only need to pay attention to DMA and interrupts. The following are some of them that are suitable for large computers. Here we only briefly talk about them:
stage with channel structure
In small computers, DMA can be used to realize the exchange of data between high-speed I/O devices and the host. However, in large and medium-sized computers, there are many I/O configurations and data transmission is trivial. If DMA is used, there will be a series of problems.
- Each I/O device is equipped with a dedicated DMA interface, which not only increases the hardware cost, but also solves the problem of DMA and CPU access conflicts, which makes the control very complicated.
- The CPU needs to manage numerous DMA interfaces, which also affects work efficiency.
Therefore, a channel is introduced. The channel is used to manage the I/O device and the components that exchange information between the main memory and the I/O device. It can be regarded as a processor with special functions. It is a dedicated processor subordinate to the CPU. The CPU does not directly participate in the management, so the resource utilization of the CPU is improved.
Stages with I/O Handlers
The input and output system developed to the fourth stage, and the I/O processor appeared. The I/O processor, also known as the peripheral processor, works independently of the host, and can not only complete the I/O control to be completed by the I/O channel, but also complete the format processing, error correction and other operations. The output system with the I/O processor has a higher degree of parallelism with the CPU work, which shows that the IO system has greater independence from the host.
summary
We can see that the goal of data transmission evolution has been to reduce CPU usage and improve CPU resource utilization.
data copy
Let's first introduce our needs today. There is a file on the disk, and now it needs to be transmitted over the network. If so what should you do? Through some of the above introductions, I believe you should have some ideas in your mind.
traditional copy
If we implement it in Java code, we will have the following implementation: The pseudo code reference is as follows:
public static void main(String[] args) {
Socket socket = null;
File file = new File("test.file");
byte[] b = new byte[(int) file.length()];
try {
InputStream in = new FileInputStream(file);
readFully(in, b);
socket.getOutputStream().write(b);
} catch (Exception e) {
}
}
private static boolean readFully(InputStream in, byte[] b) {
int size = b.length;
int offset = 0;
int len;
for (; size > 0;) {
try {
len = in.read(b, offset, size);
if (len == -1) {
return false;
}
offset += len;
size -= len;
} catch (Exception ex) {
return false;
}
}
return true;
}
This is our traditional copying method. The specific data flow diagram is as follows, PS: It is not considered here that the data in the heap needs to be copied to the direct memory when transferring data in Java.
It can be seen that our manager needs to go through four stages, 2 DMAs, 2 CPU interrupts, a total of four copies, four context switches, and will occupy two CPUs.
- The CPU sends instructions to the DMA of the I/O device, and the DMA transfers the data in our disk to the kernel buffer in the kernel space.
- The second stage triggers our CPU interrupt, and the CPU starts to copy data from the kernel buffer to our application cache
- The CPU copies data from the application cache to the socket buffer in the kernel.
- DMA copies data from the socket buffer to the network card buffer.
Advantages: low development cost, suitable for some low performance requirements, such as some management systems, I think it should be enough
Disadvantages: multiple context switches, occupying multiple CPUs, and low performance.
sendFile implements zero copy
What about zero copy above? Positioning in the wiki: usually refers to the way that when a computer sends a file on the network, it does not need to copy the file content to the user space (User Space) and directly transmits it to the network in the kernel space (Kernel Space).
In java NIO FileChannal.transferTo() implements the sendFile of the operating system, we can complete the above requirements with the following pseudo code:
public static void main(String[] args) {
SocketChannel socketChannel = SocketChannel.open();
FileChannel fileChannel = new FileInputStream("test").getChannel();
fileChannel.transferTo(0,fileChannel.size(),socketChannel);
}
We replaced our socket and fileInputStream above with the channel in java.nio, thus completing our zero copy.
The above specific process is as follows:
- When sendfie() is called, the CPU sends an instruction called DMA to copy the disk data to the kernel buffer.
- After the DMA copy is completed, an interrupt request is issued, and the CPU is copied and copied to the socket buffer. The sendFile call completes and returns. 3. DMA copies the socket buffer to the network card buffer.
You can see that we didn't copy the data into our application cache at all, so this approach is zero-copy. However, this method is still very painful. Although it is reduced to only three data copies, it still requires the CPU to interrupt the copy data. why? Because DMA needs to know the memory address before I can send data. Therefore, improvements have been made in the Linux2.4 kernel, and the corresponding data description information (memory address, offset) in the Kernel buffer is recorded into the corresponding socket buffer. The following process is finally formed: O: