Linux memory management: (1) Partner system

Article description:

Linux kernel version: 5.0
Architecture: ARM64
Reference materials and picture sources: "Running Linux Kernel"
Linux 5.0 kernel source code annotation warehouse address:

zhangzihengya/LinuxSourceCode_v5.0_study (github.com)

1. Memory fragmentation problem and partner system algorithm

Memory fragmentation is a difficult problem to solve in memory management. The Linux kernel considers how to reduce memory fragmentation when adopting the buddy system algorithm. In the buddy system algorithm, what kind of memory blocks can two two memory blocks become partners? In fact, the partner system algorithm has the following three basic conditions that need to be met:

Both memory blocks are the same size
The addresses of the two memory blocks are consecutive
Both memory blocks must be separated from the same large memory block

The division of memory blocks is shown in the figure below:

Insert image description here

During the process of merging memory blocks, the following situations may occur:

Insert image description here

At this time, P0 and P3 become holes, which results in external fragmentation . As time goes by, external fragmentation will become more and more serious, and memory utilization will also decrease.

A commonly used academic technology to solve the problem of external fragmentation is called memory compaction , which uses the position of mobile pages to connect free pages into one piece. Judging from the migration attributes of the page, the memory allocated and used by the user process can be migrated, but the memory pages used by the kernel itself cannot be migrated casually.

Why can't pages used by the kernel itself be migrated?

Because to migrate this kind of page, you first need to disconnect the mapping of the physical page, and then re-establish the mapping relationship. During the process of disconnecting the mapping relationship, if the kernel continues to access this page, it will access incorrect pointers and memory, causing Oops errors in the kernel, and even causing a system crash. As a sensitive area, the kernel must ensure that the memory used is is safe.

This is different from the user process. After the page used by the user process breaks the mapping relationship, if the user process continues to access this page, a page fault exception will occur. In page fault exception handling, physical pages can be reallocated and then mapped to virtual memory. This process is safe for user processes.

During the development stage of Linux 2.6.24, community experts introduced a function to prevent fragmentation, called anti-fragmentation . In fact, it is achieved by using migration types. Migration types are divided according to page blocks. The size of a page block is exactly the maximum memory block size that the page allocator can allocate, that is, 2 to the power of MAX_ORDER-1 bytes, which is usually 4MB.

#define MAX_ORDER 11

The types of pages are as follows:

Non-migrating type UNMOVABLE: The characteristic is that it has a fixed location in the memory and cannot be moved to other places. For example, the memory that the kernel itself needs to use falls into this category.
Movable type MOVABLE: Indicates pages that can be moved at will. This is usually a page belonging to an application.
Recyclable pages: These pages cannot be moved directly, but they can be recycled. The content of the page can be re-read or retrieved. The most typical example is the object allocated using the slab mechanism.

Therefore, the free_area data structure in the partner system contains MIGRATE_TYPES linked lists, which is equivalent to 0~MAX_ORDER-1 free_areas in the memory management zone (zone) according to the size of the order. Each free_area has several corresponding linked lists according to the MIGRATE_TYPES type, as shown in the figure below:

Insert image description here

// 内存管理区，zone 经常会被访问到，因此这个数据结构要求以 L1 高速缓存对齐
struct zone {
    
    
    ...
    // 管理空闲区域的数组，包含管理链表等
	struct free_area	free_area[MAX_ORDER];
    ...
}

struct free_area {
    
    
	struct list_head	free_list[MIGRATE_TYPES];
	unsigned long		nr_free;
};

enum migratetype {
    
    
	MIGRATE_UNMOVABLE,
	MIGRATE_MOVABLE,
	MIGRATE_RECLAIMABLE,
	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
	MIGRATE_TYPES
};

In the Linux kernel that uses this technology, the pages in all page blocks are of the same migration type, and there will be no other types of pages mixed in between.

Finally, the definition of the buddy system is given : The buddy system is a dynamic storage management method commonly used in operating systems. When the user makes a request, a memory block of appropriate size is allocated to the user, and when the user releases the memory block Recycle. In the companion system, the size of the memory block is the order power of 2 pages. The maximum order value in the Linux kernel is represented by MAX_ORDER, usually 11, that is, all free pages are grouped into 11 memory block linked lists, each memory block linked list includes 1, 2, 4, 8, 16, 32, …, 1024 consecutive pages. 1024 pages correspond to 4MB of contiguous physical memory.

2. Physical memory initialization in partner system

When the kernel starts, the kernel knows the size of the DDR physical memory and calculates the starting address of the high-end memory and the memory layout of the kernel space. The physical memory page will be added to the partner system. So how to add the physical memory page to the partner system? Woolen cloth?

When the kernel is initialized, all pages are initially marked as MIGRATE_MOVABLE type, see fee_area_init_core()→memmap_init_zone() function.

void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
		unsigned long start_pfn, enum memmap_context context,
		struct vmem_altmap *altmap)
{
    
    
    ...
    for (pfn = start_pfn; pfn < end_pfn; pfn++) {
    
    
        ...
        // 设置指定页块的MIGRATE_TYPES类型，最后调用set_pfnblock_flags_mask()来设置页块的迁移类型
        set_pageblock_migratetype(page, MIGRATE_MOVABLE);
        ...
    }
}

question:

In Linux, when the kernel is initialized, all pages are initially marked as MIGRATE_MOVABLE type. So when are the pages set to the corresponding type?

chatGPT Answer:

In Linux, all pages are marked as MIGRATE_MOVABLEtypes when the kernel is initialized. This is because during the initialization phase, the kernel does not yet know the specific uses and attributes of each page. The specific type of page is determined dynamically at runtime and can be adjusted based on the health of the system.

In the Linux kernel, the type of page may change, and the specific timing and conditions depend on different situations. Here are some common page type changes:

When allocating memory: When the kernel dynamically allocates memory through functions such as kmalloc, vmallocand so on, the kernel sets the page type based on the context and requirements of the allocation.
Page migration: In situations of tight memory, the Linux kernel may perform page migration, moving pages from one type to another to meet specific memory needs.
Memory Reclamation: When the system needs to free memory, the kernel may change the type of the page to the appropriate type to facilitate recycling or reallocation.
Memory hot-plug: On systems that support memory hot-plug, the kernel may re-evaluate and update the type of pages when memory is added or removed.

Overall, the setting of page types is a dynamic process that depends on the operating status of the system and the management strategy of the kernel. This dynamic nature allows the kernel to flexibly manage memory as needed.

Next let's consider how physical pages are added to the buddy system. Should we add pages one by one, or add 2 n pages at a time?

In the free_low_memory_core_early() function, all memory blocks are traversed through the for_each_free_mem_range() function to find the start address and end address of the memory block.

<start_kernel()->mm_init()->mem_init()->free_all_bootmem()->free_low_memory_core_early()>

static unsigned long __init free_low_memory_core_early(void)
{
    
    
    ...
    for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
				NULL)
		count += __free_memory_core(start, end);
    ...
}

Pass the memory block to the __free_pages_memory() function, which is defined as follows:

static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
    
    
	int order;

	while (start < end) {
    
    
		// 取MAX_ORDER-1和__ffS(start)中的较小值
        // __ffS(start)函数计算start中第一个位为1的位置。注意,__ffS()＝__ffS()-1
        // 假设start起始地址为0x63300，说明该地址以0xl00对齐。通过__ffS()函数来计算出的合适的order值为8，因为2的8次方等于0x100。
        order = min(MAX_ORDER - 1UL, __ffs(start));

		while (start + (1UL << order) > end)
			order--;

		// ->__free_pages_boot_core()
        memblock_free_pages(pfn_to_page(start), start, order);

		start += (1UL << order);
	}
}

// 得到 order 值后，就可以把内存块通过 __free_pages_boot_core 函数添加到伙伴系统中
static void __init __free_pages_boot_core(struct page *page, unsigned int order)
{
    
    
	...
    // 伙伴系统的核心函数，这里按照order的方式把内存块添加内存到伙伴系统中
	__free_pages(page, order);
}

3. Prerequisite knowledge: the relationship between zonelists

Since the kernel uses zones to manage a memory node, a memory node may be divided into multiple different zones. The kernel uses the zonelist data structure to manage the zone of a memory node.

// zonelist 是所有可用 zone 的链表，其中排在第一个的 zone 是页面分配器“最喜欢的”，也是首选，其他的 zone 是备选
struct zonelist {
    
    
	// 每一个 zoneref 数据结构描述一个 zone
	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};

Related data structures:

// 每一个 zoneref 数据结构描述一个 zone
struct zoneref {
    
    
	// zone 成员指向实际的 zone
	struct zone *zone;	/* Pointer to actual zone */
	// zone_idx 是一个编号，使用 zone_idx() 函数获取的编号
	// 通常：
    // 0表示最低的 zone，如 ZONE_DMA32，用于最低4GB的内存访问的设备，如只支持32位的DMA设备。
	// 1表示 ZONE_NORMAL，4GB以后的物理内存，用于线性映射物理内存。若系统内存小于4GB，则没有这个内存管理区
    // 2表示 ZONE_HGHMEM，用于管理高端内存，这些高端内存是不能线性映射到内核地址空间的。注意，在64位Linux操作系统中没有这个内存管理区
	int zone_idx;		/* zone_idx(zoneref->zone) */
};

// 内存节点的数据结构
typedef struct pglist_data {
    
    
	...
	// 有两个 zonelist：
	// 其中一个是 ZONELIST_FALLBACK，指向本地的zone，即包含备选的 zone
	// 另一个是 ZONELIST_NOFALLBACK，用于 NUMA 系统，指向远端的内存结点的 zone
	struct zonelist node_zonelists[MAX_ZONELISTS];
    ...
}

We assume that there is only one memory node in the system and two zones, namely ZONE_DMA32 and ZONE_NORMAL. Then the relationship between zone type, _zonerefs[] array and zone_idx in zonelist is as shown in the figure below:

Insert image description here

In this case, when allocating physical memory, priority is given to zones of type ZONE_NORMAL, that is, the zone with the largest zone_idx value is allocated first.

4. Quick path to page allocation

The buddy system is the basic memory allocation system in the Linux kernel. The concept of the buddy system is not difficult to understand, but allocating physical pages has always been the most complex part of memory management. It involves quite intricate mechanisms such as page recycling, memory regularization, and direct memory recycling. The page allocator is the most basic allocator in Linux kernel memory management, based on the buddy system algorithm and the zone-based design concept. This article focuses on how to allocate contiguous physical memory when there is sufficient memory.

The quick process of allocating memory is shown in the figure below:

Insert image description here

The flow of the get_page_from_freelist() function is shown in the figure below:

Insert image description here

In order to give readers a more realistic understanding, this process will be explained below based on the flow chart around the source code:

alloc_pages->...->__alloc_pages_nodemask

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
							nodemask_t *nodemask)
{
    
    
	struct page *page;
	// ALLOC_WMARK_LOW 允许分配内存的判断条件为低水位
	unsigned int alloc_flags = ALLOC_WMARK_LOW;
	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
	// alloc_context 数据结构是伙伴系统分配函数中用于保存相关参数的数据结构
	struct alloc_context ac = {
    
     };

	// 伙伴系统能分配的最大内存块大小是 2 的 MAX_ORDER-1 次方
	if (unlikely(order >= MAX_ORDER)) {
    
    
		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
		return NULL;
	}

	gfp_mask &= gfp_allowed_mask;
	alloc_mask = gfp_mask;
	// prepare_alloc_pages 函数会计算相关的信息并保存到 alloc_context 数据结构中
	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
		return NULL;

	// 用于确定首选的 zone
	finalise_ac(gfp_mask, &ac);

	// 内存碎片化方面的一个优化
	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);

	// 尝试从伙伴系统的空闲链表中分配内存
	// 若分配成功，则返回内存块的第一个页面的 page 数据结构
	// 若分配不成功，则会进入分配的慢速路径，即 __alloc_pages_slowpath 函数
	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
	if (likely(page))
		goto out;
	...
    page = __alloc_pages_slowpath(alloc_mask, order, &ac);
    
out:
    ...
    return page;
}

alloc_pages->...->__alloc_pages_nodemask->prepare_alloc_pages

// 主要用于初始化页面分配器中用到的参数，这些参数会存放在 alloc_context 数据结构中
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
		int preferred_nid, nodemask_t *nodemask,
		struct alloc_context *ac, gfp_t *alloc_mask,
		unsigned int *alloc_flags)
{
    
    
	// gfp_zone() 函数根据分配掩码计算出 zone 的 zoneidx
	ac->high_zoneidx = gfp_zone(gfp_mask);
	// node_zonelist() 函数返回首选内存节点 preferred_nid 对应的 zonelist
	// 通常一个内存节点包含两个 zonelist ：一个是 ZONELIST_FALLBACK，表示本地；另一个是 ZONELIST_NOFALLBACK，表示远端
	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
	ac->nodemask = nodemask;
	// 根据分配掩码来获取内存的迁移类型
	ac->migratetype = gfpflags_to_migratetype(gfp_mask);

	...

	// 使用新引入的故障注入技术
	if (should_fail_alloc_page(gfp_mask, order))
		return false;

	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
		*alloc_flags |= ALLOC_CMA;

	return true;
}

alloc_pages->...->__alloc_pages_nodemask->get_page_from_freelist

// 从伙伴系统的空闲页面链表中尝试分配物理页面
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
						const struct alloc_context *ac)
{
    
    
	struct zoneref *z;
	struct zone *zone;
	struct pglist_data *last_pgdat_dirty_limit = NULL;
	bool no_fallback;

retry:
	// 新增标志 ALLOC_NOFRAGMENT，表示需要避免内存碎片化
	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
	// preferred_zoneref 表示 zonelist 中首选和推荐的 zone
	z = ac->preferred_zoneref;
	// 从推荐的 zone 开始遍历所有的 zone
	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
								ac->nodemask) {
    
    
		...

		// 这是 NUMA 系统的一个特殊情况。当需要分配内存的 zone 不在本地内存结点（即在远端节点）时，
		// 要考虑的不是内存碎片化，而是内存的本地性，因为访问本地内存节点要比访问远端内存节点快很多
		if (no_fallback && nr_online_nodes > 1 &&
		    zone != ac->preferred_zoneref->zone) {
    
    
			...
		}

		// wmark_pages() 宏用来计算 zone 中某个水位的页面大小
		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
		// zone_watermark_fast() 函数用于判断当前 zone 的空闲页面是否满足 WMARK_LOW
        // 用于测试当前 zone 的水位情况，以及检查是否满足多个页面（order大于0）的分配请求
		if (!zone_watermark_fast(zone, order, mark,
				       ac_classzone_idx(ac), alloc_flags)) {
    
    
			// 处理当前的 zone 不满足内存分配需求的情况
			int ret;

			...

			// 若 node_reclaim_mode 为 0，则表示可以从下一个 zone 或者内存节点中分配内存；
			// 否则，表示可以在这个 zone 中进行一些内存回收的动作
			if (node_reclaim_mode == 0 ||
			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
				continue;

			// node_reclaim() 函数尝试回收一部分内存
			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
			...
		}

// 表示马上要从这个 zone 中分配内存了
try_this_zone:
		// rmqueue() 函数会从伙伴系统中分配内存，rmqueue() 函数是伙伴系统的核心分配函数
        // 若需要的内存块不能满足，那么可以从大内存块中“切”内存，就像切蛋糕一样
        // 如应用程序想分配order为5的内存块，但是order为5的空闲链表中没有空闲内存，而order为6的空闲链表中有，
		// 那么会把order为6的内存块取出来，然后把其中一块分配出去，把剩余的一块天骄到order为5的空闲链表中
		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
				gfp_mask, alloc_flags, ac->migratetype);
		if (page) {
    
    
			// 当从伙伴系统分配页面成功之后需要设置页面的一些属性以及做必要的检查
			// 最后返回成功分配页面的 page 数据结构
			prep_new_page(page, order, gfp_mask, alloc_flags);

			...

			return page;
		} else {
    
    
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
			/* Try again if zone has deferred pages */
			if (static_branch_unlikely(&deferred_pages)) {
    
    
				if (_deferred_grow_zone(zone, order))
					goto try_this_zone;
			}
#endif
		}
	}

	// 当遍历完所有的 zone 后，还没有成功分配出所需要的内存，最后可能的情况是系统中产生了外碎片化。这时可以重新尝试一次
	if (no_fallback) {
    
    
		alloc_flags &= ~ALLOC_NOFRAGMENT;
		goto retry;
	}

	return NULL;
}

alloc_context data structure comments:

struct alloc_context {
    
    
	// 指向每一个内存节点对应的 zonelist
	struct zonelist *zonelist;
	// 表示内存节点的掩码
	nodemask_t *nodemask;
	// 表示首选 zone 的 zoneref
	struct zoneref *preferred_zoneref;
	// 表示迁移类型
	int migratetype;
	// 分配掩码计算 zone 的 zoneidx，表示这个分配掩码允许内存分配的最高 zone
	enum zone_type high_zoneidx;
	// 用于指定是否传播脏页
	bool spread_dirty_pages;
};