Four page_fault, memory IO interaction, VSS, LRU

Whole catalog

1. Basic principles of Linux virtual memory, MMU, and paging
2. OOM scoring factor, oom_adj and oom_score
three pages of alloc and free, Buddy algorithm and CMA
four page_fault, memory IO interaction, VSS, LRU
five DMA and Cache consistency

=================================================================================

Page_fault page fault interrupt processing

The page in the linear address space of the process does not need to be resident in memory. When executing an instruction, if it is found that the page to be accessed is not in memory (that is, the existence bit is 0), then the execution of the instruction is stopped and a page does not exist. For exceptions, the corresponding fault handler can remove the fault by loading the page from the external memory. After that, the instruction that originally caused the exception can continue to execute without generating the exception.
When a page fault interrupt occurs in a process, the process will fall into the kernel mode and perform the following operations :
1. Check whether the virtual address to be accessed is legal
2. Find/allocate a physical page
3. Fill the physical page content (read the disk, or Set to 0 directly, or do nothing)
4. Establish a mapping relationship (virtual address to physical address) and
re-execute the instruction that caused the page fault interrupt.
If the disk needs to be read in step 3, then the page fault interrupt is majflt , Otherwise it is minflt.
Insert picture description here

Memory IO interaction

There are two types of memory pages for user processes:
file-backed pages
anonymous pages

For example , the code segment of the process and the mapped file are all file-backed, while the heap and stack of the process are not corresponding to the file and belong to the anonymous page.
File-backed pages can be directly written back to the corresponding hard disk file when the memory is insufficient, called page-out, and no swap area (swap) is needed; while anonymous pages can only be written to the hard disk when the memory is insufficient In the exchange area (swap), it is called swap-out.

file-backed pages

For pages with a file background, when the program reads the file, you can read it through read or mmap. When you read files from the disk in any way, the kernel will apply for a page cache for you to cache the content on the hard disk. In this case, the data that has been read once will be taken directly from the page cache when the process or other processes read it next time, which will quickly improve the overall performance of the system. Therefore, the user's read/write is actually a mutual copy with the page cache.
The user's mmap will map a segment of virtual address (3G) below to the page cache. In this way, the user can modify the file content by reading and writing this virtual address, eliminating the need for copying between the kernel and the user.
Insert picture description here
Therefore, the file is actually only the memory to the user program, and the page cache is a copy of the file in the disk. You can clear the cache through "echo 3> /proc/sys/vm/drop_cache". After clearing, the process will slow down the first time the file is read.

Through the free command, you can see the size of the memory occupied by the current page cache. The free command will print buffers and cached (some versions of the free command put the two together). The cache generated by accessing the file through the file system (mounting the file system, opening the file by file name) is recorded by cached, while the cache generated by directly operating the bare disk (opening the /dev/sda device for reading and writing) is recorded by buffers.
Insert picture description here
In fact, the file system itself reads and writes files is the way to manipulate the bare partition, and the user mode can also directly manipulate the bare disk. Operating a device name like the dd command also directly accesses the bare partition. Then, when reading and writing through the file system, there will be both cached and buffers. As you can see from the figure, metadata such as file names are related to the file system. It is cached, and the actual data is cached in buffers. For example, when reading a file (such as the ext4 file system), if the file cache hits, there is no need to go to the ext4 layer and return from the vfs layer.

Of course, you can also add the O_DIRECT mark when you open, and do direct IO. Even buffers are not entered, and the disk is read and written directly.

anonymous pages

Pages without a file background, that is, anonymous pages, such as heap, stack, data segment, etc., do not exist in the form of files, so they cannot be exchanged with disk files, but can be divided into additional swap partitions on the hard disk or use swap files Exchange. The swap partition can swap inactive pages to the hard disk to ease memory tension. The swap partition can be used as a file background for fake anonymous pages.

Page reclaim (reclaim)

The data with file background is actually the page cache, but the page cache cannot increase indefinitely, and it cannot be said that all files are cached in memory slowly. There must be a mechanism to flush out the rarely used file data from the page cache. There is a water level control mechanism in the kernel, which will trigger page reclaim when the system memory is not enough.
For pages that do not have a file background, that is, anonymous pages, such as heap, stack, and data segments, if there is no swap partition and cannot be exchanged with the disk, it must be resident in memory. But if it is resident in memory, it will consume memory. You can create a swap partition on the hard disk or create a swap file in the hard disk so that anonymous pages can also be swapped to the disk. It can be considered as a forged document background for the anonymous page. The swap partition or swap file actually finally reached the effect of increasing the memory. Of course, if the exchange is frequent, the access of the exchanged data will be slower, because there will be IO operations.

1. Watermark control:

There are three water levels in the kernel:
min: If the remaining memory is reduced to this level, it can be considered that the memory is severely insufficient, and the current process will be blocked, and the kernel will directly reclaim the memory (direct reclaim) in the process context of this process.
low: When the remaining memory slowly decreases and reaches this level, the memory recovery of the kswapd thread will be triggered.
high: When the memory is recycled, the memory slowly increases, and when this water level is reached, the recycling stops.
Since each ZONE manages its own memory separately, each ZONE has these three water levels.

2. swapness :

When reclaiming, whether to reclaim pages with file backgrounds or anonymous pages or to reclaim them, you can use /proc/sys/vm/swapness to control who reclaims more. The greater the swappiness, the more likely to reclaim anonymous pages; the smaller the swappiness, the more likely it is to reclaim file-backed pages. Of course, their recycling methods are the same LRU algorithm, that is, the least recently used pages will be recycled.

3. How to calculate the water level:

/proc/sys/vm/min_free_kbytes is a user-configurable value, the default value is min_free_kbytes = 4 * sqrt(lowmem_kbytes). Then calculate the values ​​of low and high water levels according to min: low=5/4min, high=6/4min.

dirty page dirty page write back

Sync is used to write back dirty pages. Dirty pages cannot stay in memory for too long, because dirty data will be lost if there is a sudden power failure and not written to the hard disk. On the other hand, if you save a lot and write back together, it will obviously take up CPU time.
What about writing back dirty pages? The timing of dirty page write-back is controlled by both time and space:
Time:
dirty_expire_centisecs: the expiration time of dirty pages, or understood as aging time, the unit is 1/100s, the fluxer thread in the kernel will check the time of resident memory Dirty pages that exceed dirty_expire_centisecs will be written back.
Dirty_writeback_centisecs: The interval at which the kernel's flusher thread is periodically woken up (wakeup_flusher_threads()), and every time it is woken up, it will check if any dirty pages are aging. If this value is set to 0, the fluxer thread will not be awakened at all.
Space:
dirty_ratio: When the dirty pages generated by a disk-writing process reaches this ratio, the process will write back dirty pages by itself.
dirty_background_ratio: If the number of dirty pages exceeds this ratio, the flusher thread will start dirty page writeback.
So:
even if there is only one dirty page, if it times out, it will be written back. Prevent dirty pages from staying in memory for too long. The default value of dirty_expire_centisecs is 3000, which is 30s. It can be set shorter, so that less data will be lost after power failure, but disk write operations will be more intensive.
There must not be too many dirty pages, otherwise it will cause a lot of pressure on disk IO. For example, when the memory is not enough for memory recovery, it will be time-consuming to write back dirty pages first.
It should be noted that after reaching the dirty_background_ratio, the flusher thread (named "[flush-devname]") starts to write back, but due to the slow disk writing speed, if the application process is still writing to the disk at this time, the flusher thread returns The write is not so fast, then the dirty pages of the process will reach the dirty_ratio, then the process will go to write back dirty pages and cause write to be blocked. In other words, dirty_background_ratio is usually smaller than dirty_ratio.
Dirty pages refer to pages with a file background, and anonymous pages will not have dirty pages. From the'Dirty' line of /proc/meminfo, you can see how many dirty pages are in the current system, which can be flushed with the sync command.

=============================================================

zRAM mechanism

Without the swap partition, you can also use the zRAM mechanism to alleviate the memory tension: Take out a memory space (compressed block) from the memory and use it as a swap space to simulate the swap partition of the hard disk to exchange anonymous pages and let the kernel see the physical memory The size does not include this memory. And this section of swap space has its own transparent compression function, that is, when swapping to this zRAM partition, Linux will automatically compress and store this anonymous page. When the system accesses the content of this page, it generates a page fault and takes it from the swap partition. At this time, Linux will transparently decompress it and exchange it for you.
The advantage of using zRAM is that the speed of accessing memory is much faster than accessing hard disk or flash, and there is no need to consider the lifespan. And because this memory is stored after compression, it can store more data, although it takes up a memory, but In fact, more data can be stored, and the effect of increasing memory is also achieved. The disadvantage is that compression takes up CPU time.
ZRAM technology is commonly used in Android. Since zRAM sacrifices CPU time, the number of exchanges is as small as possible. Like Android and Windows, the larger the memory, the better, because the chance of swapping is small. In this way, the two processes will become smooth when switching between them (such as Weibo and WeChat), because if there is enough memory, the background process does not need to be changed into the swap partition or killed by OOM. Of course, if you only make calls, there is no need for large memory.
Excerpt from the exchange of memory and IO

=============================================================

Interpretation of Linux process memory consumption indicators

Memory consumption indicator

VSS-Virtual Set Size virtual memory consumption (including memory occupied by shared libraries)
RSS-Resident Set Size actual physical memory used (including memory occupied by shared libraries)
PSS-Proportional Set Size actual physical memory used (occupied by shared libraries) Memory)
USS-Unique Set Size The physical memory occupied by the unique set size process alone (not including the memory occupied by shared libraries) is
selected from the interpretation of Linux process memory consumption indicators

=============================================================

LRU algorithm

A page replacement algorithm for memory management. The data block (memory block) that is in memory but is not used is called LRU. The operating system will move it out of memory according to which data belongs to LRU and make room to load other data.
What is the LRU algorithm? LRU is the abbreviation of Least Recently Used, which is the least recently used. It is often used in page replacement algorithms and serves virtual page storage management .
Regarding the memory management of the operating system, how to save and utilize the small memory to provide resources for the most processes has always been an important research direction. The virtual storage management of memory is now the most versatile and successful method-in the case of limited memory, a part of the external memory is expanded as virtual memory, and the real memory only stores the information used during the current operation. This undoubtedly greatly expands the function of the memory and greatly improves the concurrency of the computer. Virtual page storage management is a management method in which the space required by the process is divided into multiple pages, only the currently required pages are stored in the memory, and the remaining pages are placed in external storage.
However, there are advantages and disadvantages. Virtual page storage management increases the memory space required by the process, but it also brings the disadvantage of longer running time: during the process of running, it is inevitable to store some of the external memory Information is exchanged with what is already in the memory. Due to the low speed of external memory, the time spent in this step cannot be ignored. Therefore, it is also quite meaningful to adopt the best possible algorithm to reduce the number of times to read the external memory.

Guess you like

Origin blog.csdn.net/baidu_38410526/article/details/104109077
LRU