文章目录
1. 简介
虽然x86_64的物理地址范围为64bit,但是因为地址空间太大目前不可能完全用完,当前支持57bit和48bit两种虚拟地址模式。
地址模式 | 单个空间 | 用户地址空间 | 内核地址空间 |
---|---|---|---|
32位 | 2G | 0x00000000 - 0x7FFFFFFF | 0x80000000 - 0xFFFFFFFF |
64位(48bit) | 128T | 0x00000000 00000000 - 0x00007FFF FFFFFFFF | 0xFFFF8000 00000000 - 0xFFFFFFFF FFFFFFFF |
64位(57bit) | 64P | 0x00000000 00000000 - 0x00FFFFFF FFFFFFFF | 0xFF000000 00000000 - 0xFFFFFFFF FFFFFFFF |
本文我们关注内核地址空间。在内核文档Documentation/x86/x86_64/mm.txt中对内核地址空间的布局有了详细的描述:
- 48bit模式的地址空间布局(4级页表)
Start addr | Offset | End addr | Size | VM area description | 描述 |
---|---|---|---|---|---|
0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm | 用户地址空间,每个进程mm指向的都不同 |
0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | … huge, almost 64 bits wide hole of non-canonical virtual memory addresses up to the -128 TB starting offset of kernel mappings. | 巨大空洞 |
- | - | - | - | Kernel-space virtual memory, shared between all processes: | 以下为内核地址空间: |
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | … guard hole, also reserved for hypervisor | - |
ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI | LDT(Local Descriptor Table):局部描述符表 KPTI(Kernel page-table isolation):内核页表隔离 |
ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) | 线性映射的区域 |
ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | … unused hole | - |
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) | vmalloc和ioremap空间 |
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | … unused hole | - |
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) | page结构存储的位置 |
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | … unused hole | - |
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory | KASAN影子内存 |
- | - | - | - | Identical layout to the 56-bit one from here on: | 从这里开始,与56-bit布局相同: |
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | … unused hole | - |
- | - | - | - | vaddr_end for KASLR | - |
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping | - |
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | … unused hole | - |
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks | - |
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | … unused hole | - |
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space | - |
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | … unused hole | - |
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 | 内核代码区域 |
ffffffff80000000 | -2048 MB | - | - | - | - |
ffffffffa0000000 | -1536 MB | fffffffffeffffff | 1520 MB | module mapping space | 模块加载区域 |
ffffffffff000000 | -16 MB | - | - | - | - |
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset | - |
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI | - |
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | … unused hole | - |
其中重点区域的说明:
direct mapping
:直接映射覆盖系统中的所有内存,直至最高内存地址(这意味着在某些情况下,它还可以包括PCI内 memory)。
vmalloc space
:vmalloc空间也是lazy策略的,使用page_fault机制来延后分配,使用init_top_pgt
作为参考。
EFI region
:我们将EFI运行时服务映射到64Gb大型虚拟内存窗口中的“ efi_pgd” PGD中(此大小是任意的,以后可以根据需要提高)。映射不是任何其他内核PGD的一部分,并且仅在EFI运行时期间可用。
KASLR
:请注意,如果启用CONFIG_RANDOMIZE_MEMORY
,则将随机化所有物理内存,直接映射物理内存空间(direct mapping)、vmalloc/ioremap空间和虚拟内存映射。它们的顺序被保留,但是它们在启动时加上基础偏移。在此处进行任何更改时,请务必对KASLR格外小心。除KASAN阴影区域外,KASLR地址范围不得与其他区域重叠。因此KASAN为了保证正确会禁用KASLR。
- 57bit模式的地址空间布局(5级页表)
2. 内核页表初始化
2.0 decompress阶段
2.1 head_64.S
和head64.c
early_top_pgt
内核代码在跳转到start_kernel()以前,运行在head_64.S
和head64.c
中,此时使用一个临时页表early_top_pgt
来做虚拟地址到物理地址的转换:
linux-source-4.15.0\arch\x86\kernel\head_64.S:
NEXT_PGD_PAGE(early_top_pgt) /* ------- PGD(L4) ------- */
.fill 511,8,0 // 0 -510 pgd entry:为0
#ifdef CONFIG_X86_5LEVEL // 511 pgd entry:kernel image
.quad level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
#else
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
#endif
.fill PTI_USER_PGD_FILL,8,0 // PTI相关的pgd entry:为0
#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH) /* 相关index计算 */
PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE) // direct mapping对应的pgd index
PGD_START_KERNEL = pgd_index(__START_KERNEL_map) // kernel image对应的pgd index
#endif
L3_START_KERNEL = pud_index(__START_KERNEL_map) // kernel image对应的pud index
#ifdef CONFIG_X86_5LEVEL
#define __PAGE_OFFSET_BASE _AC(0xff11000000000000, UL)
#else
#define __PAGE_OFFSET_BASE _AC(0xffff888000000000, UL) // direct mapping线性映射的虚拟地址(kaslr关闭时)
#endif
#define __START_KERNEL_map _AC(0xffffffff80000000, UL) // kernel image的虚拟地址(kaslr关闭时)
init_top_pgt
init_top_pgt
的初始值同样也在head_64.S
中定义:
NEXT_PGD_PAGE(init_top_pgt) /* ------- PGD(L4) ------- */
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC // 0 pgd entry: identity mapping
.org init_top_pgt + PGD_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC // x pgd entry: direct mapping
.org init_top_pgt + PGD_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC // 511 pgd entry: kernel image
.fill PTI_USER_PGD_FILL,8,0
NEXT_PAGE(level3_ident_pgt) /* ------- PUD(L3): identity mapping/direct mapping ------- */
.quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.fill 511, 8, 0
NEXT_PAGE(level2_ident_pgt)
/*
* Since I easily can, map the first 1G.
* Don't set NX because code runs from these pages.
*
* Note: This sets _PAGE_GLOBAL despite whether
* the CPU supports it or it is enabled. But,
* the CPU should ignore the bit.
*/
PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD) /* ------- PMD(L2): identity mapping/direct mapping ------- */
// pmd huge page大小为2M,定义了一个page的pmd entry,总大小为1G
NEXT_PAGE(level3_kernel_pgt) /* ------- PUD(L3): kernel ------- */
.fill L3_START_KERNEL,8,0 // 0 - x pud entry: 0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC// 510 pud entry: kernel image
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC // 511 pud entry: fixmap
NEXT_PAGE(level2_kernel_pgt)
/*
* 512 MB kernel mapping. We spend a full page on this pagetable
* anyway.
*
* The kernel code+data+bss must not be bigger than that.
*
* (NOTE: at +512MB starts the module area, see MODULES_VADDR.
* If you want to increase this then increase MODULES_VADDR
* too.)
*
* This table is eventually used by the kernel during normal
* runtime. Care must be taken to clear out undesired bits
* later, like _PAGE_RW or _PAGE_GLOBAL in some cases.
*/
PMDS(0, __PAGE_KERNEL_LARGE_EXEC, /* ------- PMD(L2): kernel image ------- */
KERNEL_IMAGE_SIZE/PMD_SIZE) // pmd huge page大小为2M,总大小为512M
NEXT_PAGE(level2_fixmap_pgt) /* ------- PMD(L2): fixmap ------- */
.fill (512 - 4 - FIXMAP_PMD_NUM),8,0
pgtno = 0
.rept (FIXMAP_PMD_NUM)
.quad level1_fixmap_pgt + (pgtno << PAGE_SHIFT) - __START_KERNEL_map \
+ _PAGE_TABLE_NOENC;
pgtno = pgtno + 1
.endr
/* 6 MB reserved space + a 2MB hole */
.fill 4,8,0
NEXT_PAGE(level1_fixmap_pgt) /* ------- PTE(L1): fixmap ------- */
.rept (FIXMAP_PMD_NUM)
.fill 512,8,0
.endr
#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024) // kernel image 区域的大小
/* Automate the creation of 1 to 1 mapping pmd entries */ // PMDS()宏定义
#define PMDS(START, PERM, COUNT) \
i = 0 ; \
.rept (COUNT) ; \
.quad (START) + (i << PMD_SHIFT) + (PERM) ; \
i = i + 1 ; \
.endr
在构造页表entry的时候,因为需要用到的是物理地址,所以需要减去__START_KERNEL_map
。例如一个entry的定义:
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
其中:
level3_ident_pgt - __START_KERNEL_map // 等于下一级页表的物理地址
_KERNPG_TABLE_NOENC // 当前entry的属性
entry属性的相关定义:
#define _PAGE_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | \
_PAGE_ACCESSED | _PAGE_DIRTY)
初始状态下,init_top_pgt
创建了一个示意如下的页表:
其中主要建立了4块区域的映射:
region | size | desctipt |
---|---|---|
identity mapping | 1G | 虚拟地址和物理地址相等 |
direct mapping | 1G | 线性映射空间,起始虚拟地址为PAGE_OFFSET |
kernel image | 512M | 内核映像映射空间 |
fixmap | - | 固定映射空间 |
但是在跳转到start_kernel()之前,内核重新构造了init_top_pgt
:
linux-source-4.15.0\arch\x86\mm\init.c:
Ljump_to_C_code → initial_code → x86_64_start_kernel() → x86_64_start_reservations() → start_kernel()
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
{
/* (1) 清理init_top_pgt中所有内容 */
clear_page(init_top_pgt);
/* set init_top_pgt kernel high mapping*/
/* (2) 只保留最高端的映射kernel mapping (kernel image & fixmap)
低端的映射identity mapping和direct mapping被清理,需要在start_kernel()中重新建立映射
*/
init_top_pgt[511] = early_top_pgt[511];
}
参考:
1.趣谈Linux操作系统学习笔记:内核页表
2.一起分析Linux系统设计思想——03内核启动流程分析(五)
3.内核早期的页表
2.2 start_kernel()
在内核代码跳转到start_kernel()以后,最终会启用一份正式的页表init_top_pgt
即swapper_pg_dir
,也是内核运行时内核空间的页表:
#define swapper_pg_dir init_top_pgt
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
INIT_MM_CONTEXT(init_mm)
};
在start_kernel() → setup_arch() → init_mem_mapping()
时会把页表切换成init_top_pgt
:
void __init init_mem_mapping(void)
{
load_cr3(swapper_pg_dir);
__flush_tlb_all();
}
在这之前,start_kernel()需要在init_top_pgt
中构造好正式的内核页表映射。
2.2.1 物理内存(e820)
首先内核需要获得系统的物理地址布局,x86下的物理地址布局称为e820表。
在系统boot的时候,kernel通过0x15中断获得机器内存容量。有三种参数88H(只能探测最大64MB的内存),E801H(得到大小),E802H(获得memory map),这个memory map称为E820图。
start_kernel() → setup_arch() → e820__memory_setup():
/*
* Calls e820__memory_setup_default() in essence to pick up the firmware/bootloader
* E820 map - with an optional platform quirk available for virtual platforms
* to override this method of boot environment processing:
*/
void __init e820__memory_setup(void)
{
char *who;
/* This is a firmware interface ABI - make sure we don't break it: */
BUILD_BUG_ON(sizeof(struct boot_e820_entry) != 20);
/* (1) 调用e820__memory_setup_default(),从boot读到e820表 */
who = x86_init.resources.memory_setup();
/* (2) 备份e820表 */
memcpy(e820_table_kexec, e820_table, sizeof(*e820_table_kexec));
memcpy(e820_table_firmware, e820_table, sizeof(*e820_table_firmware));
/* (3) 打印出e820表 */
pr_info("e820: BIOS-provided physical RAM map:\n");
e820__print_table(who);
}
/* (3.1) e820表中可能存储的类型 */
static void __init e820_print_type(enum e820_type type)
{
switch (type) {
case E820_TYPE_RAM: /* Fall through: */
case E820_TYPE_RESERVED_KERN: pr_cont("usable"); break;
case E820_TYPE_RESERVED: pr_cont("reserved"); break;
case E820_TYPE_ACPI: pr_cont("ACPI data"); break;
case E820_TYPE_NVS: pr_cont("ACPI NVS"); break;
case E820_TYPE_UNUSABLE: pr_cont("unusable"); break;
case E820_TYPE_PMEM: /* Fall through: */
case E820_TYPE_PRAM: pr_cont("persistent (type %u)", type); break;
default: pr_cont("type %u", type); break;
}
}
可以看到,e820表就是一个数组,它存储了系统物理地址中的内存、ACPI和一些保留区域。我们看一下e820表的打印实例(ubuntu 18.04 总共4G内存):
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009e7ff] usable // 内存,1M以下
[ 0.000000] BIOS-e820: [mem 0x000000000009e800-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bfecffff] usable // 内存,3G左右
[ 0.000000] BIOS-e820: [mem 0x00000000bfed0000-0x00000000bfefefff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000bfeff000-0x00000000bfefffff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000bff00000-0x00000000bfffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffe0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable // 内存,1G
拿到e820表的信息以后,内核先根据自己的需要对e820表做一些自定义的修改,最终使用它来初始化memblock::
start_kernel() → setup_arch():
void __init setup_arch(char **cmdline_p)
{
/* (1) 获取e820表 */
e820__memory_setup();
/* (2) 在e820表中保留setup数据 */
e820__reserve_setup_data();
/* (3) 处理用户使用early_params自定义了e820表 */
e820__finish_early_params();
/* (4) 检查并且确保kernel image区域在e820表中是ram */
e820_add_kernel_range();
/* (5) 确保page 0在e820表中是ram
移除掉可能的bios ram区域
*/
trim_bios_range();
/* (6) 将e820表中的所有内存表项,加入到memblock中 */
e820__memblock_setup();
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
e820__reserve_resources();
e820__register_nosave_regions(max_pfn);
e820__setup_pci_gap();
}
↓
2.2.2 初始内存分配机制(memblock/bootmem)
在linux内核的启动过程中在buddy系统正式工作之前,需要一个临时的内存分配机制来满足这个阶段的内存分配需求。最早的临时分配机制是bootmem
,现在普遍使用的是memblock
。
memblock
的核心也是一些内存数组,最核心的是memory
和reserved
数组:
struct memblock memblock __initdata_memblock = {
.memory.regions = memblock_memory_init_regions,
.memory.cnt = 1, /* empty dummy entry */
.memory.max = INIT_MEMBLOCK_REGIONS,
.memory.name = "memory",
.reserved.regions = memblock_reserved_init_regions,
.reserved.cnt = 1, /* empty dummy entry */
.reserved.max = INIT_MEMBLOCK_REGIONS,
.reserved.name = "reserved",
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
.physmem.regions = memblock_physmem_init_regions,
.physmem.cnt = 1, /* empty dummy entry */
.physmem.max = INIT_PHYSMEM_REGIONS,
.physmem.name = "physmem",
#endif
.bottom_up = false,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock;
#endif
#define INIT_MEMBLOCK_REGIONS 128
#define INIT_PHYSMEM_REGIONS 4
/* Definition of memblock flags. */
enum {
MEMBLOCK_NONE = 0x0, /* No special request */
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
};
struct memblock_region {
phys_addr_t base; // 起始地址
phys_addr_t size; // 大小
unsigned long flags; // 相关标志
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
- 1、
memblock
表是编译的时候就初始化好了的,在setup_arch()过程中会创建一些reserved
条目,并且会把e820表中ram条目转换成memblock
的memory
条目:
start_kernel() → setup_arch():
void __init setup_arch(char **cmdline_p)
{
/* (1.1) 将内核的text、data、bss区域的物理地址加入到memblock的reserved保留区域中,禁止动态分配 */
memblock_reserve(__pa_symbol(_text),
(unsigned long)__bss_stop - (unsigned long)_text);
/* (1.2) 将page 0物理内存加入到reserved保留区域中 */
memblock_reserve(0, PAGE_SIZE);
/* (1.3) 在memblock中保留ramdisk_image区域的物理内存 */
early_reserve_initrd();
/* (1.4) 在memblock中保留efi区域的物理内存 */
if (efi_enabled(EFI_BOOT))
efi_memblock_x86_reserve_range();
/* after early param, so could get panic from serial */
/* (1.5) 在memblock中保留setup_data区域的物理内存 */
memblock_x86_reserve_range_setup_data();
/*
* Define random base addresses for memory sections after max_pfn is
* defined and before each memory section base is used.
*/
/* (7) 如果配置了RANDOMIZE_MEMORY,将page_offset_base、vmalloc_base、vmemmap_base的基址随机化 */
kernel_randomize_memory();
/* (1.6) 在memblock中保留ibft区域的物理内存 */
reserve_ibft_region();
/*
* Need to conclude brk, before e820__memblock_setup()
* it could use memblock_find_in_range, could overlap with
* brk area.
*/
/* (1.7) 在memblock中保留brk区域的物理内存 */
reserve_brk();
/* (2.1) 设置当前memblock可分配的最大物理内存上限为ISA_END_ADDRESS,即1M(0x00100000) */
memblock_set_current_limit(ISA_END_ADDRESS);
/* (2.2) 把所有e820表中的内存条目(usable),添加到memblock转换成memory条目 */
e820__memblock_setup();
/* (1.8) 在memblock中保留bios区域的物理内存 */
reserve_bios_regions();
if (efi_enabled(EFI_MEMMAP)) {
efi_fake_memmap();
efi_find_mirror();
efi_esrt_init();
/*
* The EFI specification says that boot service code won't be
* called after ExitBootServices(). This is, in fact, a lie.
*/
/* (1.9) 在memblock中保留efi区域的物理内存 */
efi_reserve_boot_services();
}
/* preallocate 4k for mptable mpc */
/* (1.10) 预先从memblock中分配出4k物理内存,并且加入到e820的保留区域中 */
e820__memblock_alloc_reserved_mpc_new();
/* (1.11) 在memblock中保留real mode区域的物理内存 */
reserve_real_mode();
/* (1.12) 在memblock中保留低内存区域的物理内存 */
trim_platform_memory_ranges();
trim_low_memory_range();
/* (3.1) 创建内存的页表映射 */
init_mem_mapping();
/* (3.2) 设置当前memblock可分配的最大物理内存上限为最大
现在可以使用memblock分配内存,并且已经做好了线性映射,可以得到虚拟地址了
*/
memblock_set_current_limit(get_max_mapped());
/* Allocate bigger log buffer */
/* (4.1) 使用memblock分配lobuf内存 */
setup_log_buf(1);
reserve_initrd();
/* (1.13) 在memblock中保留crash区域的物理内存 */
reserve_crashkernel();
/* (5) 创建vmemmap区域,并且创建buddy的zone结构 */
x86_init.paging.pagetable_init();
/* (6) kasan初始化 */
kasan_init();
}
↓
void __init e820__memblock_setup(void)
{
int i;
u64 end;
memblock_allow_resize();
/* (2.2.1) 逐个遍历e820表 */
for (i = 0; i < e820_table->nr_entries; i++) {
struct e820_entry *entry = &e820_table->entries[i];
end = entry->addr + entry->size;
if (end != (resource_size_t)end)
continue;
/* (2.2.2) 使用其中的内存表项 */
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;
/* (2.2.3) 将其加入到memblock的memory区域中 */
memblock_add(entry->addr, entry->size);
}
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
memblock_dump_all();
}
我们看一下memblock表的打印实例(ubuntu 18.04 总共4G内存):
[ 0.000000] MEMBLOCK configuration:
[ 0.000000] memory size = 0x00000000fff6d800 reserved size = 0x00000000052abb0c
[ 0.000000] memory.cnt = 0x4
[ 0.000000] memory[0x0] [0x0000000000001000-0x000000000009dfff], 0x000000000009d000 bytes flags: 0x0
[ 0.000000] memory[0x1] [0x0000000000100000-0x00000000bfecffff], 0x00000000bfdd0000 bytes flags: 0x0
[ 0.000000] memory[0x2] [0x00000000bff00000-0x00000000bfffffff], 0x0000000000100000 bytes flags: 0x0
[ 0.000000] memory[0x3] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes flags: 0x0
[ 0.000000] reserved.cnt = 0x5
[ 0.000000] reserved[0x0] [0x0000000000000000-0x0000000000000fff], 0x0000000000001000 bytes flags: 0x0
[ 0.000000] reserved[0x1] [0x000000000009ed70-0x000000000009f86b], 0x0000000000000afc bytes flags: 0x0
[ 0.000000] reserved[0x2] [0x00000000000f6a70-0x00000000000f6a7f], 0x0000000000000010 bytes flags: 0x0
[ 0.000000] reserved[0x3] [0x00000000311e1000-0x00000000348e7fff], 0x0000000003707000 bytes flags: 0x0
[ 0.000000] reserved[0x4] [0x000000007b400000-0x000000007cfa2fff], 0x0000000001ba3000 bytes flags: 0x0
- 2、在setup_arch()后续过程中,可以使用
memblock
来分配和释放内存:
在e820__memblock_setup()以后,memblock
已经有内存可以分配了,可以通过memblock_alloc()
来分配物理内存:
memblock_alloc()
或者在init_mem_mapping()以后线性映射区(direct mapping)的页表已经创建,可以通过memblock_virt_alloc()
分配物理内存并得到对应的虚拟地址:
memblock_virt_alloc() → memblock_virt_alloc_try_nid() → memblock_virt_alloc_internal():
static void * __init memblock_virt_alloc_internal(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr,
int nid)
{
/* (1) 从memblock分配到一段物理内存 */
alloc = memblock_find_in_range_node(size, align, min_addr, max_addr,
nid, flags);
if (alloc && !memblock_reserve(alloc, size))
goto done;
/* (2) 通过线性区的映射关系,得到内存对应的虚拟地址 */
ptr = phys_to_virt(alloc);
return ptr;
}
- 3、在初始化的后续阶段
buddy
系统创建好了以后,释放memblock
中所有的内存到buddy
中,有buddy
来承担后续的内存分配工作:
start_kernel() → mm_init() → mem_init():
void __init mem_init(void)
{
pci_iommu_alloc();
/* clear_bss() already clear the empty_zero_page */
/* this will put all memory onto the freelists */
/* (1) 把`memblock`所有尚未分配的内存释放到`buddy`系统中 */
free_all_bootmem();
after_bootmem = 1;
/*
* Must be done after boot memory is put on freelist, because here we
* might set fields in deferred struct pages that have not yet been
* initialized, and free_all_bootmem() initializes all the reserved
* deferred pages for us.
*/
register_page_bootmem_info();
/* Register memory areas for /proc/kcore */
kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
mem_init_print_info(NULL);
}
↓
free_all_bootmem() → free_low_memory_core_early():
static unsigned long __init free_low_memory_core_early(void)
{
unsigned long count = 0;
phys_addr_t start, end;
u64 i;
memblock_clear_hotplug(0, -1);
for_each_reserved_mem_region(i, &start, &end)
reserve_bootmem_region(start, end);
/*
* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
* low ram will be on Node1
*/
for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
NULL)
count += __free_memory_core(start, end);
return count;
}
参考:
1.MEMBLOCK 内存分配器
2.2.3 线性映射区的创建(direct mapping)
如上文描述,init_top_pgt
中初始的identity mapping
和direct mapping
在跳转到start_kernel()前已经被清理了,在start_kernel()过程中需要重新建立direct mapping
线性映射区。具体在init_mem_mapping()中创建的。
在目前这个位置,是整个内核页表初始化过程中最关键的时刻,也是整个初始化的精髓。所以这里会详细的展开讲一下相关背景。
- __pa() 物理地址获取
对64bit内核空间来说,有两块空间最为重要:
1、内核映像区域(kernel mapping)。这块区域将内核映像(包括code+data+bss+brk),从物理地址phys_base
映射到虚拟地址__START_KERNEL_map
。在32bit下没有phys_base
一说内核映像基本差不多从物理地址0开始,但在64bit下为了支持KASLR内核映像在物理内存中是一个随机地址phys_base
。另外32bit下是没有独立内核映像区域的,它是一起映射到线性地址空间的。
2、线性地址空间(direct mapping)。这块区域把所有物理内存线性映射到PAGE_OFFSET
虚拟地址。PAGE_OFFSET
的值可能是固定的0xffff888000000000
,或者KASLR使能后的随机地址page_offset_base
。
如上述,所以在内核中存在两块线性映射的区域。那么内核的数据既可能在内核映像区域
也可能在线性地址空间
,如果我们想对内核虚拟地址
转换成物理地址
该怎么办呢?64bit的__pa()函数对这两个区域做了兼容:
#define __pa(x) __phys_addr((unsigned long)(x))
#define __phys_addr(x) __phys_addr_nodebug(x)
static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;
/* use the carry flag to determine if x was < __START_KERNEL_map */
/* (1) 内核映像区域: pa = va - __START_KERNEL_map + phys_base
线性地址空间: pa = va - PAGE_OFFSET
*/
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
return x;
}
但是对于物理地址
转换成虚拟地址
,64bit的__va()函数并未做兼容:
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
region | __pa() | __va() |
---|---|---|
内核映像区域(kernel mapping) | pa = va - __START_KERNEL_map + phys_base | - |
线性地址空间(direct mapping) | pa = va - PAGE_OFFSET | va = pa + PAGE_OFFSET |
- early_alloc_pgt_buf
从上图我们也可以看出现在代码处于的一个临时页表映射状态:
1、当前还是临时映射early_top_pgt
。目前页表只映射了内核映像区域(kernel mapping)
区域。
2、准备切换到正式映射init_top_pgt
。init_top_pgt
会复用early_top_pgt
已经创建的内核映像区域(kernel mapping)
区域,并且新建线性地址空间(direct mapping)
。
3、在创建线性地址空间(direct mapping)
页表的过程中,p4d/pud/pmd/pte需要分配新的内存并且能得到对应的虚拟地址和物理地址。这部分内存从何而来?内核在内核映像区域
的brk
区域中巧妙的保留了一小块区域称为early_alloc_pgt_buf
,专门用来在这个关键点来使用。
链接时在内核brk中保留空间:
#ifndef CONFIG_RANDOMIZE_MEMORY
#define INIT_PGD_PAGE_COUNT 6
#else
#define INIT_PGD_PAGE_COUNT 12
#endif
#define INIT_PGT_BUF_SIZE (INIT_PGD_PAGE_COUNT * PAGE_SIZE)
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);
linux-source-4.15.0\arch\x86\kernel\vmlinux.lds.S:
/* BSS */
. = ALIGN(PAGE_SIZE);
.bss : AT(ADDR(.bss) - LOAD_OFFSET) {
__bss_start = .;
*(.bss..page_aligned)
*(.bss)
. = ALIGN(PAGE_SIZE);
__bss_stop = .;
}
. = ALIGN(PAGE_SIZE);
.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
__brk_base = .;
. += 64 * 1024; /* 64k alignment slop space */
/* (1) 在brk中预留的区域 */
*(.brk_reservation) /* areas brk users have reserved */
__brk_limit = .;
}
在setup_arch()时将这部分内存添加进pgt_buf
:
start_kernel() → setup_arch() → early_alloc_pgt_buf():
void __init early_alloc_pgt_buf(void)
{
unsigned long tables = INIT_PGT_BUF_SIZE;
phys_addr_t base;
/* (1) 获取到brk reserve保留页面的物理地址 */
base = __pa(extend_brk(tables, PAGE_SIZE));
/* (2) 把物理地址对应的页帧号存储到pgt_buf中 */
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}
init_mem_mapping()在创建线性地址空间(direct mapping)
页表的过程中,使用alloc_low_pages()函数从pgt_buf
中分配内存,用来当做p4d/pud/pmd/pte
使用。并且使用__pa()
来访问它的物理内存。
这里还是有疑惑的,early_alloc_pgt_buf中的
__va()
是怎么能访问到的?是利用early_top_pgt
中1G的direct mapping
映射?
__ref void *alloc_low_pages(unsigned int num)
{
unsigned long pfn;
int i;
/* (1.1) 分配方式1:从buddy系统中分配物理内存page */
if (after_bootmem) {
unsigned int order;
order = get_order((unsigned long)num << PAGE_SHIFT);
return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
}
/* (1.2) 分配方式2:从memblock系统中分配物理内存page */
if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
if (min_pfn_mapped >= max_pfn_mapped)
panic("alloc_low_pages: ran out of memory");
ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_pages: can not alloc memory");
memblock_reserve(ret, PAGE_SIZE * num);
pfn = ret >> PAGE_SHIFT;
/* (1.3) 分配方式3:从pgt_buf系统中分配物理内存page */
} else {
pfn = pgt_buf_end;
pgt_buf_end += num;
printk(KERN_DEBUG "BRK [%#010lx, %#010lx] PGTABLE\n",
pfn << PAGE_SHIFT, (pgt_buf_end << PAGE_SHIFT) - 1);
}
/* (2) 清零对应page */
for (i = 0; i < num; i++) {
void *adr;
adr = __va((pfn + i) << PAGE_SHIFT);
clear_page(adr);
}
/* (3) 根据页帧号得到物理地址,再根据物理地址返回虚拟地址 */
return __va(pfn << PAGE_SHIFT);
}
- init_mem_mapping()
start_kernel() → setup_arch() → init_mem_mapping():
void __init init_mem_mapping(void)
{
unsigned long end;
pti_check_boottime_disable();
probe_page_size_mask();
setup_pcid();
#ifdef CONFIG_X86_64
end = max_pfn << PAGE_SHIFT;
#else
end = max_low_pfn << PAGE_SHIFT;
#endif
/* the ISA range is always mapped regardless of memory holes */
/* (1) 创建1M以下的线性地址区的映射 */
init_memory_mapping(0, ISA_END_ADDRESS);
/* Init the trampoline, possibly with KASLR memory offset */
init_trampoline();
/*
* If the allocation is in bottom-up direction, we setup direct mapping
* in bottom-up, otherwise we setup direct mapping in top-down.
*/
if (memblock_bottom_up()) {
unsigned long kernel_end = __pa_symbol(_end);
/*
* we need two separate calls here. This is because we want to
* allocate page tables above the kernel. So we first map
* [kernel_end, end) to make memory above the kernel be mapped
* as soon as possible. And then use page tables allocated above
* the kernel to map [ISA_END_ADDRESS, kernel_end).
*/
memory_map_bottom_up(kernel_end, end);
memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
} else {
/* (2) 创建1M以上的线性地址区的映射
调用init_range_memory_mapping() → init_memory_mapping()
使用for_each_mem_pfn_range()逐个遍历memblock.memory中的区域,建立起对应的direct mapping映射
*/
memory_map_top_down(ISA_END_ADDRESS, end);
}
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
/* can we preseve max_low_pfn ?*/
max_low_pfn = max_pfn;
}
#else
early_ioremap_page_table_range_init();
#endif
/* (3) 重新加载cr3,正式启用`init_top_pgt`页表 */
load_cr3(swapper_pg_dir);
__flush_tlb_all();
x86_init.hyper.init_mem_mapping();
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
direct mapping
创建页表的核心函数为init_memory_mapping():
unsigned long __ref init_memory_mapping(unsigned long start,
unsigned long end)
{
struct map_range mr[NR_RANGE_MR];
unsigned long ret = 0;
int nr_range, i;
pr_debug("init_memory_mapping: [mem %#010lx-%#010lx]\n",
start, end - 1);
memset(mr, 0, sizeof(mr));
/* (1) 将目标区域按照对齐,尽可能的切割成大块。
因为direct mapping区域一旦创建就不会动态的撤销,所以我们尽可能使用huge page去映射
pud huge page = 1G
pmd huge page = 2M
*/
nr_range = split_mem_range(mr, 0, start, end);
/* (2) 针对切割后的物理地址区域,创建`p4d/pud/pmd/pte`映射页表 */
for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
mr[i].page_size_mask);
add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
return ret >> PAGE_SHIFT;
}
↓
unsigned long __meminit
kernel_physical_mapping_init(unsigned long paddr_start,
unsigned long paddr_end,
unsigned long page_size_mask)
{
bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
paddr_last = paddr_end;
/* (1) 根据物理地址计算虚拟地址
va = pa + PAGE_OFFSET
这样就把物理地址映射到PAGE_OFFSET开始的线性映射区域了
*/
vaddr = (unsigned long)__va(paddr_start);
vaddr_end = (unsigned long)__va(paddr_end);
vaddr_start = vaddr;
/* (2) 逐个创建地址对应的`p4d/pud/pmd/pte`映射页表结构 */
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
/* (2.1) 从init_m,即从swapper_pg_dir/init_top_pgt中获取pgd */
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
if (pgd_val(*pgd)) {
p4d = (p4d_t *)pgd_page_vaddr(*pgd);
paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}
/* (2.2) 从上述的early_alloc_pgt_buf中分配`p4d/pud/pmd/pte`
因为已经做好了映射,可以正常访问这部分内存
*/
p4d = alloc_low_page();
paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
page_size_mask);
spin_lock(&init_mm.page_table_lock);
if (IS_ENABLED(CONFIG_X86_5LEVEL))
pgd_populate(&init_mm, pgd, p4d);
else
p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
return paddr_last;
}
参考:
1.【Linux内存源码分析】建立内核页表(1)
2.【Linux内存源码分析】建立内核页表(2)
2.2.4 page存储区的创建(vmemmap)
除了上述几块映射区域,内核还有一块区域也是固定映射的,这就是vmemmap
。vmemmap
区域是用来存放物理页帧的管理结构struct page
的。内核花了很多精力去管理物理页帧:
pa = pfn << PAGE_SHIFT // 根据页帧偏移找到对应物理地址
va = pa + PAGE_OFFSET // 根据物理地址找到对应虚拟地址
page = (vmemmap + (pfn)) // 根据页帧偏移找到对应page结构,__pfn_to_page(pfn)
MEM mode
1、FLATMEM mode
。内核最早是用一块连续的内存mem_map
来保存struct page
结构的,这种称为FLATMEM mode
平板模式。但是这种模式在物理内存不连续的情况下,会存在较大的浪费。
2、DISCONTIGMEM mode
和SPARSEMEM mode
。后续针对物理内存不连续的情况使用两级数组来存储struct page
结构,使用这种思路的实现有DISCONTIGMEM mode
非连续模式和SPARSEMEM mode
稀疏模式。这种多级数组来存储的方式虽然节约了空间,但是也存在一个问题在查找的时候需要多次计算转换,增加开销且不便理解。
3、SPARSEMEM_VMEMMAP
。内核利用64位虚拟地址资源较多的特点,内核把SPARSEMEM mode
分配的struct page
结构映射到一块连续的虚拟地址上。不过这块虚拟地址是有空洞的,在没有物理内存present的区域,是没有分配内存来存储struct page
的。这样既节约了内存空间,又能在计算时统一计算。
在memory_model.h
中可以看到不同模式下pfn和struct page的转换关系:
linux-source-4.15.0\include\asm-generic\memory_model.h:
/*
* supports 3 memory models.
*/
#if defined(CONFIG_FLATMEM)
#define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
#define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + \
ARCH_PFN_OFFSET)
#elif defined(CONFIG_DISCONTIGMEM)
#define __pfn_to_page(pfn) \
({ unsigned long __pfn = (pfn); \
unsigned long __nid = arch_pfn_to_nid(__pfn); \
NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
})
#define __page_to_pfn(pg) \
({ const struct page *__pg = (pg); \
struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
(unsigned long)(__pg - __pgdat->node_mem_map) + \
__pgdat->node_start_pfn; \
})
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
/* memmap is virtually contiguous. */
#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
#elif defined(CONFIG_SPARSEMEM)
/*
* Note: section's mem_map is encoded to reflect its start_pfn.
* section[i].section_mem_map == mem_map's address - start_pfn;
*/
#define __page_to_pfn(pg) \
({ const struct page *__pg = (pg); \
int __sec = page_to_section(__pg); \
(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
})
#define __pfn_to_page(pfn) \
({ unsigned long __pfn = (pfn); \
struct mem_section *__sec = __pfn_to_section(__pfn); \
__section_mem_map_addr(__sec) + __pfn; \
})
#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
vmemmap
区域的创建
SPARSEMEM mode
把内存分为大小128M的section,每个section对应一个mem_section
控制结构,ms->section_mem_map
中存储的是本section对应的struct page
空间,这些空间会统一映射到vmemmap
:
# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
x86_64常用的是SPARSEMEM_VMEMMAP
模式,需要创建一个vmemmap
映射区域,根据pfn能找到对应的struct page
结构。vmemmap
在创建完成以后,就不会动态的改变。
start_kernel() → setup_arch() → x86_init.paging.pagetable_init() → native_pagetable_init() → paging_init():
void __init paging_init(void)
{
/* (1) 根据memblock.memory中物理内存的分布情况,遍历设置mem_section是否为空 */
sparse_memory_present_with_active_regions(MAX_NUMNODES);
/* (2) 给有物理内存的mem_section,
分配存储本section对应的`struct page`空间:使用memblock_virt_alloc()分配的内存
并统一映射到`vmemmap`: 使用sparse_mem_map_populate()重新建立了映射
*/
sparse_init();
/*
* clear the default setting with node 0
* note: don't use nodes_clear here, that is really clearing when
* numa support is not compiled in, and later node_set_state
* will not set it back.
*/
node_clear_state(0, N_MEMORY);
if (N_MEMORY != N_NORMAL_MEMORY)
node_clear_state(0, N_NORMAL_MEMORY);
/* (3) 初始化`struct page`结构
zone_sizes_init() → free_area_init_nodes() → free_area_init_node() → free_area_init_core() → memmap_init() → memmap_init_zone() → __init_single_page():
*/
zone_sizes_init();
}
↓
void __init sparse_init(void)
{
unsigned long pnum;
struct page *map;
unsigned long *usemap;
unsigned long **usemap_map;
int size;
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
int size2;
struct page **map_map;
#endif
/* see include/linux/mmzone.h 'struct mem_section' definition */
BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
set_pageblock_order();
/*
* map is using big page (aka 2M in x86 64 bit)
* usemap is less one page (aka 24 bytes)
* so alloc 2M (with 2M align) and 24 bytes in turn will
* make next 2M slip to one more 2M later.
* then in big system, the memory will have a lot of holes...
* here try to allocate 2M pages continuously.
*
* powerpc need to call sparse_init_one_section right after each
* sparse_early_mem_map_alloc, so allocate usemap_map at first.
*/
/* (2.1) 每个section分配一个`unsigned long *`结构
用来保存每个section对应内存的hotplug状态
*/
size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
usemap_map = memblock_virt_alloc(size, 0);
if (!usemap_map)
panic("can not allocate usemap_map\n");
alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
(void *)usemap_map);
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
/* (2.2) 每个section分配一个`struct page *`结构
用来保存每个section用来存储`struct page`的空间
*/
size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
map_map = memblock_virt_alloc(size2, 0);
if (!map_map)
panic("can not allocate map_map\n");
alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
(void *)map_map);
#endif
/* (2.3) 逐个遍历有内存present的section */
for_each_present_section_nr(0, pnum) {
/* (2.3.1) 判断section是否hotplug */
usemap = usemap_map[pnum];
if (!usemap)
continue;
/* (2.3.2) 获取到本section对应`struct page`的内存空间 */
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
map = sparse_early_mem_map_alloc(pnum);
#endif
if (!map)
continue;
/* (2.3.3) 把`struct page`的内存空间重新映射到`vmemmap` */
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
usemap);
}
vmemmap_populate_print_last();
/* (2.4) 释放掉临时的map_map、usemap_map */
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
memblock_free_early(__pa(map_map), size2);
#endif
memblock_free_early(__pa(usemap_map), size);
}
参考:
1.mm-mem_section
2.Linux内存模型
2.2.5 正式内存分配机制的初始化(buddy)
在vmemmap区域创建完成以后,差不多就可以使用buddy来管理内存分配了。
对于buddy来说,最重要的就是两个链表:一个可分配的free链表,一个可回收的lru链表。对于buddy系统需专门开篇分析,这里就不再展开。
- buddy zone 初始化
start_kernel() → setup_arch() → x86_init.paging.pagetable_init() → native_pagetable_init() → paging_init() → zone_sizes_init()
- buddy zone 接管memblock
start_kernel() → mm_init() → mem_init() → free_all_bootmem() → free_low_memory_core_early()
- free 可分配链表
struct zone {
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER]; // free链表,page分配用的。
}
#define MAX_ORDER 11
struct free_area {
struct list_head free_list[MIGRATE_TYPES]; // 每种order,又按照迁移的种类分成多个链表
unsigned long nr_free;
};
enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
* ZONE_MOVABLE works. Only movable pages can be allocated
* from MIGRATE_CMA pageblocks and page allocator never
* implicitly change migration type of MIGRATE_CMA pageblock.
*
* The way to use it is to change migratetype of a range of
* pageblocks to MIGRATE_CMA which can be done by
* __free_pageblock_cma() function. What is important though
* is that a range of pageblocks must be aligned to
* MAX_ORDER_NR_PAGES should biggest page be bigger then
* a single pageblock.
*/
MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE, /* can't allocate from here */
#endif
MIGRATE_TYPES
};
- lru 可回收链表
struct zone {
struct pglist_data *zone_pgdat;
}
typedef struct pglist_data {
/* Fields commonly accessed by the page reclaim scanner */
struct lruvec lruvec;
}
struct lruvec {
struct list_head lists[NR_LRU_LISTS]; // LRU链表,page回收用的
struct zone_reclaim_stat reclaim_stat;
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
/* Refaults at the time of last reclaim cycle */
unsigned long refaults;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
};
enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE,
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};
2.2.6 非连续内存区(vmalloc)
vmalloc这段内核中的虚拟地址,主要用途是用来把物理上离散的内存映射成一段连续的虚拟地址,是内核态利用内存碎片的一个很有效的手段。
这段区域的数据结构和用户态的vma区域很类似,但也有些区别:
1、每一段vmalloc区域用vmap_area + vm_struct
结构来管理,挂载到vmap_area_root
红黑树上。而用户态的地址空间数据结构为vm_area_struct
,挂载在task->mm->mm_rb
红黑树上。
2、vmalloc区域在调用的时候就已经分配好物理页面并且建立好地址映射。而用户态的地址空间是lazy策略,在分配的时候只是创建了vma结构,在使用的时候才会通过page_fault机制来分配实际物理页面并拷贝对应内容。
3、vmalloc的分配和释放是以page为基本单位的。
- vmalloc初始化
start_kernel() → mm_init() → vmalloc_init():
- vmalloc分配
vmalloc() → __vmalloc_node_flags() → __vmalloc_node() → __vmalloc_node_range():
void *__vmalloc_node_range(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller)
{
struct vm_struct *area;
void *addr;
unsigned long real_size = size;
size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > totalram_pages)
goto fail;
/* (1) 在`task->mm->mm_rb`红黑树中,分配一段长度适合的虚拟地址
并且创建对应的`vmap_area + vm_struct`结构
*/
area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
vm_flags, start, end, node, gfp_mask, caller);
if (!area)
goto fail;
/* (2) 分配对应长度的物理页帧,并且建立起物理页帧和虚拟地址之间的映射 */
addr = __vmalloc_area_node(area, gfp_mask, prot, node);
if (!addr)
return NULL;
/*
* In this function, newly allocated vm_struct has VM_UNINITIALIZED
* flag. It means that vm_struct is not fully initialized.
* Now, it is fully initialized, so remove this flag here.
*/
clear_vm_uninitialized_flag(area);
kmemleak_vmalloc(area, size, gfp_mask);
return addr;
fail:
warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure: %lu bytes", real_size);
return NULL;
}
↓
static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot, int node)
{
struct page **pages;
unsigned int nr_pages, array_size, i;
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
0 :
__GFP_HIGHMEM;
nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
/* Please note that the recursion is strictly bounded. */
if (array_size > PAGE_SIZE) {
pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
PAGE_KERNEL, node, area->caller);
} else {
pages = kmalloc_node(array_size, nested_gfp, node);
}
if (!pages) {
remove_vm_area(area->addr);
kfree(area);
return NULL;
}
area->pages = pages;
area->nr_pages = nr_pages;
/* (2.1) 根据需要的长度,逐个page分配需要的页帧 */
for (i = 0; i < area->nr_pages; i++) {
struct page *page;
if (node == NUMA_NO_NODE)
page = alloc_page(alloc_mask|highmem_mask);
else
page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vunmap() */
area->nr_pages = i;
goto fail;
}
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();
}
/* (2.2) 将得到的物理页帧和虚拟地址之间建立起映射 */
if (map_vm_area(area, prot, pages))
goto fail;
return area->addr;
fail:
warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure, allocated %ld of %ld bytes",
(area->nr_pages*PAGE_SIZE), area->size);
vfree(area->addr);
return NULL;
}
参考:
1.Vmalloc实现原理
2.2.7 KASLR
内核地址空间随机化,它主要包括以下内容:
- RANDOMIZE_BASE
它会做几部分的工作:
1、将内核映像在物理内存中的加载地址随机化,随机基址就是之前的phys_base
。
2、将内核映像在内核虚拟地址中进行随机化,即__START_KERNEL_map
的随机化。但是在4.15内核中这部分好像没有随机化。
3、需要将内核编译成位置无关代码。
- RANDOMIZE_MEMORY
是一个x86-64特有的功能,开启以后会随机化线性映射区域基址(page_offset_base)、vmalloc区域基址(vmalloc_base)、vmemmap区域基址(vmemmap_base)。
start_kernel() → setup_arch() → kernel_randomize_memory():
/* Default values */
unsigned long page_offset_base = __PAGE_OFFSET_BASE;
EXPORT_SYMBOL(page_offset_base);
unsigned long vmalloc_base = __VMALLOC_BASE;
EXPORT_SYMBOL(vmalloc_base);
unsigned long vmemmap_base = __VMEMMAP_BASE;
EXPORT_SYMBOL(vmemmap_base);
static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
{ &vmemmap_base, 0 },
};
void __init kernel_randomize_memory(void)
{
size_t i;
unsigned long vaddr = vaddr_start;
unsigned long rand, memory_tb;
struct rnd_state rand_state;
unsigned long remain_entropy;
unsigned long vmemmap_size;
/*
* These BUILD_BUG_ON checks ensure the memory layout is consistent
* with the vaddr_start/vaddr_end variables. These checks are very
* limited....
*/
BUILD_BUG_ON(vaddr_start >= vaddr_end);
BUILD_BUG_ON(vaddr_end != CPU_ENTRY_AREA_BASE);
BUILD_BUG_ON(vaddr_end > __START_KERNEL_map);
/* (1) 判断kaslr和memory random有没有使能 */
if (!kaslr_memory_enabled())
return;
/*
* Update Physical memory mapping to available and
* add padding if needed (especially for memory hotplug support).
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
kaslr_regions[0].size_tb = memory_tb;
/*
* Calculate the vmemmap region size in TBs, aligned to a TB
* boundary.
*/
vmemmap_size = (kaslr_regions[0].size_tb << (TB_SHIFT - PAGE_SHIFT)) *
sizeof(struct page);
kaslr_regions[2].size_tb = DIV_ROUND_UP(vmemmap_size, 1UL << TB_SHIFT);
/* Calculate entropy available between regions */
remain_entropy = vaddr_end - vaddr_start;
for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
remain_entropy -= get_padding(&kaslr_regions[i]);
prandom_seed_state(&rand_state, kaslr_get_random_long("Memory"));
/* (2) 逐个计算随机化的基地址 */
for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) {
unsigned long entropy;
/*
* Select a random virtual address using the extra entropy
* available.
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
if (IS_ENABLED(CONFIG_X86_5LEVEL))
entropy = (rand % (entropy + 1)) & P4D_MASK;
else
entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
/*
* Jump the region and add a minimum padding based on
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
if (IS_ENABLED(CONFIG_X86_5LEVEL))
vaddr = round_up(vaddr + 1, P4D_SIZE);
else
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}
参考:
1.Linux KASLR机制详解
2.2.8 进程创建(fork())
_do_fork() → copy_process() → copy_mm()
3. 调试接口
/sys/kernel/debug/page_tables
:
linux-source-4.15.0\arch\x86\mm\debug_pagetables.c:
static int __init pt_dump_debug_init(void)
{
dir = debugfs_create_dir("page_tables", NULL);
if (!dir)
return -ENOMEM;
pe_knl = debugfs_create_file("kernel", 0400, dir, NULL,
&ptdump_fops);
if (!pe_knl)
goto err;
pe_curknl = debugfs_create_file("current_kernel", 0400,
dir, NULL, &ptdump_curknl_fops);
if (!pe_curknl)
goto err;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
pe_curusr = debugfs_create_file("current_user", 0400,
dir, NULL, &ptdump_curusr_fops);
if (!pe_curusr)
goto err;
#endif
return 0;
err:
debugfs_remove_recursive(dir);
return -ENOMEM;
}
参考文档:
1.ARM64 Kernel Image Mapping的变化
2.KASLR
3.用户态进程空间布局 — mmap()详解
4.用户态进程空间的创建 — execve() 详解
5.分页寻址(Paging)机制详解
6.Linux 64位进程地址空间分布概况
7.x86(32位)分页管理的机制
8.mm-mem_section
9.Linux KASLR机制详解
10.趣谈Linux操作系统学习笔记:内核页表
11.一起分析Linux系统设计思想——03内核启动流程分析(五)