Can the Cache in Linux memory really be recycled?

In Linux systems, we often use the free command to view the usage status of system memory. On a RHEL6 system, the display content of the free command is probably like this:

[root@tencent64 ~]# free
             total       used       free     shared    buffers     cached
Mem:     132256952   72571772   59685180          0    1762632   53034704
-/+ buffers/cache:   17774436  114482516
Swap:      2101192        508    2100684

The default display unit here is kb, and my server has 128G of memory, so the number appears to be relatively large. This command is a must-have for almost everyone who has used Linux, but the more such a command is, the fewer people seem to really understand (I mean the lesser the proportion). In general, the understanding of the output of this command can be divided into the following levels:

Don't understand. The first reaction of such a person is: God, I use up a lot of memory, more than 70 GB, but I hardly run any big programs? Why is this happening? Linux takes up memory!
I thought I knew it well. Such people usually say after self-study evaluation: Well, according to my professional vision, the memory is only about 17G, and there is still a lot of remaining memory available. The buffers/cache occupy a lot, indicating that there are processes in the system that have read and written files, but it does not matter, this part of the memory is used as idle.
really understand. This kind of person's reaction makes people feel that they don't understand Linux the most. Their reaction is: free shows this, well I know. What? You ask me if this memory is enough, of course I don't know! How do I know how your program is written?

According to the content of the technical documents on the Internet, I believe that the vast majority of people who know a little about Linux should be in the second level. It is generally believed that the memory space occupied by buffers and cached can be released as free space when memory pressure is high. But is it really so? Before discussing this topic, let's briefly introduce what buffers and cached mean:

What is buffer/cache?

Buffer and cache are two terms that are overused in computer technology, and have different meanings in different contexts. In Linux memory management, the buffer here refers to the Linux memory: Buffer cache. The cache here refers to the page cache in Linux memory. Translated into Chinese can be called buffer cache and page cache. Historically, one of them (buffer) was used as a write cache for io devices, and the other (cache) was used as a read cache for io devices. The io devices here mainly refer to block device files and Ordinary files on the file system. But now, their meanings are different. In the current kernel, page cache, as its name implies, is a cache for memory pages. To put it bluntly, if there is memory allocated and managed by pages, it can be managed and used by page cache as its cache. Of course, not all memory is managed by pages, and many are managed by blocks. If the cache function is to be used in this part of the memory, it will be concentrated in the buffer cache for use. (From this point of view, is it better to change the name of buffer cache to block cache?) However, not all blocks have a fixed length. The length of a block on the system is mainly determined by the block device used, while the page The length is 4k on X86 whether it is 32-bit or 64-bit.

After understanding the difference between these two cache systems, you can understand what they can be used for.

what is page cache

Page cache is mainly used as a cache of file data on the file system, especially when the process has read/write operations on the file. If you think about it, as a system call that maps files to memory: is it natural that mmap should also use the page cache? In the current system implementation, the page cache is also used as a cache device for other file types, so in fact the page cache is also responsible for the caching of most block device files.

what is buffer cache

Buffer cache is mainly designed to be used by systems that cache data for blocks when the system reads and writes to block devices. This means that certain operations on blocks are cached using the buffer cache, such as when we format the file system. In general, the two cache systems are used together. For example, when we write a file, the content of the page cache will be changed, and the buffer cache can be used to mark the page as a different buffer, and Records which buffer was modified. In this way, the kernel does not need to write back the entire page, but only needs to write back the modified part when performing the writeback of dirty data.

How to reclaim cache?

The Linux kernel will trigger memory reclamation when the memory is about to be exhausted, in order to release the memory for the process that needs the memory urgently. Under normal circumstances, the main memory release in this operation comes from the release of buffer/cache. Especially when more cache space is used. Since it is mainly used for caching, it is only to speed up the process of reading and writing files when the memory is sufficient, then in the case of high memory pressure, it is of course necessary to clear and release the cache, which is used as free space for related processes. So in general, we think that buffer/cache space can be released, and this understanding is correct.

But the work of clearing the cache is not without cost. By understanding what the cache does, you can understand that the cache must ensure that the data in the cache is consistent with the data in the corresponding file before the cache can be released. Therefore, with the behavior of cache clearing, the system IO is generally soaring. Because the kernel needs to compare whether the data in the cache is consistent with the data on the corresponding hard disk file, if it is inconsistent, it needs to be written back before it can be recycled.

In addition to clearing the cache when the memory will be exhausted in the system, we can also use the following file to manually trigger the cache clearing operation:

[root@tencent64 ~]# cat /proc/sys/vm/drop_caches 
1

the way is:

echo 1 > /proc/sys/vm/drop_caches

Of course, the values that can be set in this file are 1, 2, and 3, respectively. What they mean is:
echo 1 > /proc/sys/vm/drop_caches : means clearing the pagecache.

echo 2 > /proc/sys/vm/drop_caches:表示清除回收slab分配器中的对象（包括目录项缓存和inode缓存）。slab分配器是内核中管理内存的一种机制，其中很多缓存数据实现都是用的pagecache。

echo 1 > /proc/sys/vm/drop_caches:表示清除pagecache和slab分配器中的缓存对象。

cache都能被回收么？

我们分析了cache能被回收的情况，那么有没有不能被回收的cache呢？当然有。我们先来看第一种情况：

tmpfs

大家知道Linux提供一种“临时”文件系统叫做tmpfs，它可以将内存的一部分空间拿来当做文件系统使用，使内存空间可以当做目录文件来用。现在绝大多数Linux系统都有一个叫做/dev/shm的tmpfs目录，就是这样一种存在。当然，我们也可以手工创建一个自己的tmpfs，方法如下：

[root@tencent64 ~]# mkdir /tmp/tmpfs
[root@tencent64 ~]# mount -t tmpfs -o size=20G none /tmp/tmpfs/

[root@tencent64 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10325000   3529604   6270916  37% /
/dev/sda3             20646064   9595940  10001360  49% /usr/local
/dev/mapper/vg-data  103212320  26244284  71725156  27% /data
tmpfs                 66128476  14709004  51419472  23% /dev/shm
none                  20971520         0  20971520   0% /tmp/tmpfs

于是我们就创建了一个新的tmpfs，空间是20G，我们可以在/tmp/tmpfs中创建一个20G以内的文件。如果我们创建的文件实际占用的空间是内存的话，那么这些数据应该占用内存空间的什么部分呢？根据pagecache的实现功能可以理解，既然是某种文件系统，那么自然该使用pagecache的空间来管理。我们试试是不是这样？

[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         36         89          0          1         19
-/+ buffers/cache:         15        111
Swap:            2          0          2
[root@tencent64 ~]# dd if=/dev/zero of=/tmp/tmpfs/testfile bs=1G count=13
13+0 records in
13+0 records out
13958643712 bytes (14 GB) copied, 9.49858 s, 1.5 GB/s
[root@tencent64 ~]# 
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         49         76          0          1         32
-/+ buffers/cache:         15        110
Swap:            2          0          2

我们在tmpfs目录下创建了一个13G的文件，并通过前后free命令的对比发现，cached增长了13G，说明这个文件确实放在了内存里并且内核使用的是cache作为存储。再看看我们关心的指标： -/+ buffers/cache那一行。我们发现，在这种情况下free命令仍然提示我们有110G内存可用，但是真的有这么多么？我们可以人工触发内存回收看看现在到底能回收多少内存：

[root@tencent64 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         43         82          0          0         29
-/+ buffers/cache:         14        111
Swap:            2          0          2

可以看到，cached占用的空间并没有像我们想象的那样完全被释放，其中13G的空间仍然被/tmp/tmpfs中的文件占用的。当然，我的系统中还有其他不可释放的cache占用着其余16G内存空间。那么tmpfs占用的cache空间什么时候会被释放呢？是在其文件被删除的时候.如果不删除文件，无论内存耗尽到什么程度，内核都不会自动帮你把tmpfs中的文件删除来释放cache空间。

[root@tencent64 ~]# rm /tmp/tmpfs/testfile 
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         30         95          0          0         16
-/+ buffers/cache:         14        111
Swap:            2          0          2

这是我们分析的第一种cache不能被回收的情况。还有其他情况，比如：

共享内存

共享内存是系统提供给我们的一种常用的进程间通信（IPC）方式，但是这种通信方式不能在shell中申请和使用，所以我们需要一个简单的测试程序，由于微信公众平台字数限制，代码请到我的博客原文中看。

程序功能很简单，就是申请一段不到2G共享内存，然后打开一个子进程对这段共享内存做一个初始化操作，父进程等子进程初始化完之后输出一下共享内存的内容，然后退出。但是退出之前并没有删除这段共享内存。我们来看看这个程序执行前后的内存使用：

[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         30         95          0          0         16
-/+ buffers/cache:         14        111
Swap:            2          0          2
[root@tencent64 ~]# ./shm 
shmid: 294918
shmsize: 2145386496
shmid: 294918
shmsize: -4194304
Hello!
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         32         93          0          0         18
-/+ buffers/cache:         14        111
Swap:            2          0          2

cached空间由16G涨到了18G。那么这段cache能被回收么？继续测试：

[root@tencent64 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         32         93          0          0         18
-/+ buffers/cache:         14        111
Swap:            2          0          2

结果是仍然不可回收。大家可以观察到，这段共享内存即使没人使用，仍然会长期存放在cache中，直到其被删除。删除方法有两种，一种是程序中使用shmdt()，另一种是使用ipcrm命令。我们来删除试试：

[root@tencent64 ~]# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00005feb 0          root       666        12000      4                       
0x00005fe7 32769      root       666        524288     2                       
0x00005fe8 65538      root       666        2097152    2                       
0x00038c0e 131075     root       777        2072       1                       
0x00038c14 163844     root       777        5603392    0                       
0x00038c09 196613     root       777        221248     0                       
0x00000000 294918     root       600        2145386496 0                       

[root@tencent64 ~]# ipcrm -m 294918
[root@tencent64 ~]# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00005feb 0          root       666        12000      4                       
0x00005fe7 32769      root       666        524288     2                       
0x00005fe8 65538      root       666        2097152    2                       
0x00038c0e 131075     root       777        2072       1                       
0x00038c14 163844     root       777        5603392    0                       
0x00038c09 196613     root       777        221248     0                       

[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         30         95          0          0         16
-/+ buffers/cache:         14        111
Swap:            2          0          2

删除共享内存后，cache被正常释放了。这个行为与tmpfs的逻辑类似。内核底层在实现共享内存（shm）、消息队列（msg）和信号量数组（sem）这些POSIX:XSI的IPC机制的内存存储时，使用的都是tmpfs。这也是为什么共享内存的操作逻辑与tmpfs类似的原因。当然，一般情况下是shm占用的内存更多，所以我们在此重点强调共享内存的使用。说到共享内存，Linux还给我们提供了另外一种共享内存的方法，就是：

mmap

mmap()是一个非常重要的系统调用，这仅从mmap本身的功能描述上是看不出来的。从字面上看，mmap就是将一个文件映射进进程的虚拟内存地址，之后就可以通过操作内存的方式对文件的内容进行操作。但是实际上这个调用的用途是很广泛的。当malloc申请内存时，小段内存内核使用sbrk处理，而大段内存就会使用mmap。当系统调用exec族函数执行时，因为其本质上是将一个可执行文件加载到内存执行，所以内核很自然的就可以使用mmap方式进行处理。我们在此仅仅考虑一种情况，就是使用mmap进行共享内存的申请时，会不会跟shmget()一样也使用cache？

同样，我们也需要一个简单的测试程序：

[root@tencent64 ~]# cat mmap.c 
#include <stdlib.h>
#include <stdio.h>
#include <strings.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

#define MEMSIZE 1024*1024*1023*2
#define MPFILE "./mmapfile"

int main()
{
    void *ptr;
    int fd;

    fd = open(MPFILE, O_RDWR);
    if (fd < 0) {
        perror("open()");
        exit(1);
    }

    ptr = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, fd, 0);
    if (ptr == NULL) {
        perror("malloc()");
        exit(1);
    }

    printf("%p\n", ptr);
    bzero(ptr, MEMSIZE);

    sleep(100);

    munmap(ptr, MEMSIZE);
    close(fd);

    exit(1);
}

这次我们干脆不用什么父子进程的方式了，就一个进程，申请一段2G的mmap共享内存，然后初始化这段空间之后等待100秒，再解除影射所以我们需要在它sleep这100秒内检查我们的系统内存使用，看看它用的是什么空间？当然在这之前要先创建一个2G的文件./mmapfile。结果如下：

[root@tencent64 ~]# dd if=/dev/zero of=mmapfile bs=1G count=2
[root@tencent64 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         30         95          0          0         16
-/+ buffers/cache:         14        111
Swap:            2          0          2

然后执行测试程序：

[root@tencent64 ~]# ./mmap &
[1] 19157
0x7f1ae3635000
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         32         93          0          0         18
-/+ buffers/cache:         14        111
Swap:            2          0          2

[root@tencent64 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         32         93          0          0         18
-/+ buffers/cache:         14        111
Swap:            2          0          2

我们可以看到，在程序执行期间，cached一直为18G，比之前涨了2G，并且此时这段cache仍然无法被回收。然后我们等待100秒之后程序结束。

[root@tencent64 ~]# 
[1]+  Exit 1                  ./mmap
[root@tencent64 ~]# 
[root@tencent64 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:           126         30         95          0          0         16
-/+ buffers/cache:         14        111
Swap:            2          0          2

程序退出之后，cached占用的空间被释放。这样我们可以看到，使用mmap申请标志状态为MAP_SHARED的内存，内核也是使用的cache进行存储的。在进程对相关内存没有释放之前，这段cache也是不能被正常释放的。实际上，mmap的MAP_SHARED方式申请的内存，在内核中也是由tmpfs实现的。由此我们也可以推测，由于共享库的只读部分在内存中都是以mmap的MAP_SHARED方式进行管理，实际上它们也都是要占用cache且无法被释放的。

最后

我们通过三个测试例子，发现Linux系统内存中的cache并不是在所有情况下都能被释放当做空闲空间用的。并且也也明确了，即使可以释放cache，也并不是对系统来说没有成本的。总结一下要点，我们应该记得这样几点：

当cache作为文件缓存被释放的时候会引发IO变高，这是cache加快文件访问速度所要付出的成本。
tmpfs中存储的文件会占用cache空间，除非文件删除否则这个cache不会被自动释放。
使用shmget方式申请的共享内存会占用cache空间，除非共享内存被ipcrm或者shmdt，否则相关的cache空间都不会被自动释放。
使用mmap方法申请的MAP_SHARED标志的内存会占用cache空间，除非进程将这段内存munmap，否则相关的cache空间都不会被自动释放。
实际上shmget、mmap的共享内存，在内核层都是通过tmpfs实现的，tmpfs实现的存储用的都是cache。

当理解了这些的时候，希望大家对free命令的理解可以达到我们说的第三个层次。我们应该明白，内存的使用并不是简单的概念，cache也并不是真的可以当成空闲空间用的。如果我们要真正深刻理解你的系统上的内存到底使用的是否合理，是需要理解清楚很多更细节知识，并且对相关业务的实现做更细节判断的。我们当前实验场景是Centos 6的环境，不同版本的Linux的free现实的状态可能不一样，大家可以自己去找出不同的原因。

当然，本文所述的也不是所有的cache不能被释放的情形。那么，在你的应用场景下，还有那些cache不能被释放的场景呢？

转载至http://blog.csdn.net/bingqingsuimeng/article/details/52084339