Linux内核入门到放弃-进程虚拟内存-《深入Linux内核架构》笔记

进程地址空间的布局

<mm_types.h>

<mm_types.h>  struct mm_struct {  
...
    unsigned long (*get_unmapped_area) (struct file *filp,unsigned long addr, unsigned long len,unsigned long pgoff, unsigned long flags);  
...    
    unsigned long mmap_base; /* mmap区域的基地址 */    
    unsigned long task_size; /* 进程虚拟内存空间的长度 */  
...    
    unsigned long start_code, end_code, start_data, end_data;//包含可执行代码区域,和已初始化数据区域(在ELF二进制文件映射到地址空间中之后,这些区域的长度不再改变)
    unsigned long start_brk, brk, start_stack;  //堆的区域,分配堆时,只改变brk值  
    unsigned long arg_start, arg_end, env_start, env_end;  //两个区域都位于栈中最高的区域
...  
}  
 

text段如何映射到虚拟地址空间中由ELF标准确定。每个体系结构都指定了一个特定的起始地址:

  • IA-32:0x0804 8000
  • AMD64:0x0000 0000 0040 0000

IA-32映射区布局:其想法在于固定值限制栈的最大长度。由于栈是有界的,因此安置内存映射的区域可以在栈末端的下方立即开始。与经典方法相反,该区域现在是自顶向下扩展。(IA-32)

AMD64系统上对虚拟地址空间总是使用经典布局,因此无需区分各种选项。

void arch_pick_mmap_layout(struct mm_struct *mm)
{
#ifdef CONFIG_IA32_EMULATION
    if (current_thread_info()->flags & _TIF_IA32)
        return ia32_pick_mmap_layout(mm);//该函数实际上是IA-32系统上arch_pick_mmap_layout的一个相同副本
#endif
    mm->mmap_base = TASK_UNMAPPED_BASE;
    if (current->flags & PF_RANDOMIZE) {
        /* Add 28bit randomness which is about 40bits of address space
           because mmap base has to be page aligned.
           or ~1/128 of the total user VM
           (total user address space is 47bits) */
          /*
          最初生成的随机偏移量是28位,因为mmap基地址必须对齐到页,因此将该值左移PAGE_SHIFT位(12),最后的偏移是40位。大约是用户虚拟内存总量的1/128
          */
        unsigned rnd = get_random_int() & 0xfffffff;
        mm->mmap_base += ((unsigned long)rnd) << PAGE_SHIFT;
    }
    mm->get_unmapped_area = arch_get_unmapped_area;
    mm->unmap_area = arch_unmap_area;
}

内存映射的原理

内核必须提供数据结构,以建立虚拟地址空间的区域和相关数据所在位置之间的关联。

内核利用address_space数据结构,提供一组方法从后备存储器读取数据。

按需分配和填充页称之为按需调页法(demand paging):

  • 进程试图访问用户地址空间中的一个内存地址,但使用页表无法确定物理地址(物理内存中没有关联页)。
  • 处理器触发缺页异常,发送到内核
  • 内核会检查负责缺页区域的进程地址空间数据结构,找到适当的后备存储器,或者确认该访问实际上是不正确的。
  • 分配物理页,并从后备存储器读取所需数据填充
  • 借助于页表将物理内存并入到用户进程的地址空间,应用程序恢复执行。

这些操作对用户进程来是透明的。换句话说,进程不会注意到页是实际在物理内存中,还是需要通过按需调页加载。

数据结构

struct mm_struct很重要,按前文的讨论,该结构提供了进程在内存中布局所有必要信息。另外,它还包括下列成员,用于管理用户进程在虚拟地址空间中的所有内存区域。

struct mm_struct {
    struct vm_area_struct * mmap; //虚拟内存区域列表
    struct rb_root mm_rb; 
    struct vm_area_struct * mmap_cache; //上一次find_vma结构
}

树和链表

每个区域都通过一个vm_area_struct实列描述,进程的各区按两种方法排序。

  • 在一个单链表上(开始于mm_struct->mmap)。
  • 在一个红黑树中,根节点位于mm_rb。

虚拟内存区域的表示

struct vm_area_struct {
    struct mm_struct * vm_mm;   /* The address space we belong to. *///所属地址空间
    unsigned long vm_start;     /* Our start address within vm_mm. *///vm_mm内的起始地址
    unsigned long vm_end;       /* The first byte after our end address///在vm_mm内结束地址之后的第一个字节的地址
                       within vm_mm. */

    /* linked list of VM areas per task, sorted by address */
    struct vm_area_struct *vm_next;//各进程的虚拟内存区域链表,按地址排序

    pgprot_t vm_page_prot;      /* Access permissions of this VMA. *///该虚拟内存区域的访问权限
    unsigned long vm_flags;     /* Flags, listed below. *///标志

    struct rb_node vm_rb;

    /*
     * For areas with an address space and backing store,
     * linkage into the address_space->i_mmap prio tree, or
     * linkage to the list of like vmas hanging off its node, or
     * linkage of vma in the address_space->i_mmap_nonlinear list.
     */
    /*
    对于有地址空间和后备存储器的区域来说
    shared连接到address_space->i_mmap优先树, 
    或连接到悬挂在优先树结点之外、类似的一组虚拟内存区域的链表,
    或连接到address_space->i_mmap_nonlinear链表中的虚拟内存区域。 
    */
    union {
        struct {
            struct list_head list;
            void *parent;   /* aligns with prio_tree_node parent */
            struct vm_area_struct *head;
        } vm_set;

        struct raw_prio_tree_node prio_tree_node;
    } shared;

    /*
     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
     * list, after a COW of one of the file pages.  A MAP_SHARED vma
     * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
     * or brk vma (with NULL file) can only be in an anon_vma list.
     */
     /*
     *在文件的某一页经过写时复制之后,文件的MAP_PRIVATE虚拟内存区域可能同时在i_mmap树和 
     *anon_vma链表中。MAP_SHARED虚拟内存区域只能在i_mmap树中。 
     * 匿名的MAP_PRIVATE、栈或brk虚拟内存区域(file指针为NULL)只能处于anon_vma链表中。 
     */
     //anon_vma_node和anon_vma用于管理源自匿名映射(anonymous mapping)的共享页
    struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
    struct anon_vma *anon_vma;  /* Serialized by page_table_lock */

    /* Function pointers to deal with this struct. */
    struct vm_operations_struct * vm_ops;

    /* Information about our backing store: */
    unsigned long vm_pgoff;     /* Offset (within vm_file) in PAGE_SIZE units, *not* PAGE_CACHE_SIZE *///指定文件映射的偏移量(单位页)
    struct file * vm_file;      /* File we map to (can be NULL). */
    void * vm_private_data;     /* was vm_pte (shared mem) */
    unsigned long vm_truncate_count;/* truncate_count or restart_addr */

#ifndef CONFIG_MMU
    atomic_t vm_usage;      /* refcount (VMAs shared if !MMU) */
#endif
#ifdef CONFIG_NUMA
    struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
#endif
};

给出文件中的一个区间,内核有时需要知道该区间映射到的所有进程。这种映射称为共享映射。

优先查找树

优先查找树用于建立文件中的一个区域与该区域映射到的所有虚拟地址空间之间的关联。

每个打开文件都表示为struct file的一个实列。该结构包含了一个指向地址空间对象struct address_space的指针。该对象是优先查找树的基础,而文件区间与其映射到的地址空间之间的关联即通过优先树建立。

<fs.h>  
struct address_space {    
    struct inode *host; /* owner: inode, block_device */  
...
    struct prio_tree_root i_mmap; /* 私有和共享映射的树 */   
    struct list_head i_mmap_nonlinear;/*VM_NONLINEAR映射的链表 */  
...  
} 

<fs.h>  
struct file {  
...    
    struct address_space *f_mapping; 
...  
} 

<fs.h>  
struct inode {  
...    
    struct address_space *i_mapping; 
...  
}  
 

优先树是地址空间的基本要素,而优先树包含了所有相关的vm_area_struct实列,描述了与inode 关联的文件区间到一些虚拟地址空间的映射。

记住:一个给定的struct vm_area实列,可以包含在两个数据结构中。一个建立进程虚拟地址空间中的区域与潜在的文件数据之间的关联,一个用于查找映射了给定文件区间的所有地址空间。

优先树的表示

优先树用来管理表示给定文件中特定区间的所有vm_area_struct实列。这要求该数据结构不仅能够处理重叠,还要能处理相同的文件区间。

对区域的操作

  • 如果一个新区域紧接着现存区域前后(或两个现存区域之间),内核将涉及的数据结构合并为一个。当然,前提是涉及的所有区域的访问权限是相同的,而且是从同一个后备存储器映射的连续数据
  • 如果在区域的开始或结束处进行删除,则必须截断现存的数据结构
  • 如果删除两个区域之间的一个区域,那么一方面需要减小现存数据结构的长度,另一方面需要为形成的新区域创建一个新的数据结构。

将虚拟地址关联到区域

通过虚拟地址,find_vma可以找到满足 addr vm_end条件的第一个区域。

struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr)
{
    struct vm_area_struct *vma = NULL;

    if (mm) {
        /* Check the cache first. */
        /* (Cache hit rate is typically around 35%.) */
        vma = mm->mmap_cache;
        if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
            struct rb_node * rb_node;

            rb_node = mm->mm_rb.rb_node;
            vma = NULL;

            while (rb_node) {
                struct vm_area_struct * vma_tmp;

                vma_tmp = rb_entry(rb_node,
                        struct vm_area_struct, vm_rb);

                if (vma_tmp->vm_end > addr) {
                    vma = vma_tmp;
                    if (vma_tmp->vm_start <= addr)
                        break;
                    rb_node = rb_node->rb_left;
                } else
                    rb_node = rb_node->rb_right;
            }
            if (vma)
                mm->mmap_cache = vma;
        }
    }
    return vma;
}

find_vma_intersection是另一个辅助函数,用于确认边界start_addr和end_addr区间是否完全包含在一个现存区域内部,其基于find_vma.

static inline struct vm_area_struct * find_vma_intersection(struct mm_struct *mm,unsigned long start_addr,unsigned long end_addr){
    struct vm_area_struct *vma = find_vma(mm,start_addr);
    
    if(vma && end_addr<=vma->vm_start)
        vma=NULL;
    return vma;
}

区域合并

在新区域被加到进程的地址空间时,内核会检查它是否可以与一个或多个现存区域合并。

vm_merge在可能的情况下,将一个新区域与周边区域合并。

struct vm_area_struct *vma_merge(struct mm_struct *mm,
            struct vm_area_struct *prev, unsigned long addr,
            unsigned long end, unsigned long vm_flags,
                struct anon_vma *anon_vma, struct file *file,
            pgoff_t pgoff, struct mempolicy *policy)
{
    pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
    struct vm_area_struct *area, *next;

    /*
     * We later require that vma->vm_flags == vm_flags,
     * so this tests vma->vm_flags & VM_SPECIAL, too.
     */
    if (vm_flags & VM_SPECIAL)
        return NULL;

    if (prev)
        next = prev->vm_next;
    else
        next = mm->mmap;
    area = next;
    if (next && next->vm_end == end)        /* cases 6, 7, 8 */
        next = next->vm_next;

    /*
     * Can it merge with the predecessor?
     */
    if (prev && prev->vm_end == addr &&
            mpol_equal(vma_policy(prev), policy) &&
            can_vma_merge_after(prev, vm_flags,
                        anon_vma, file, pgoff)) {//如果两个文件映射在地址空间中连续,但在文件中不连续,亦无法合并
        /*
         * OK, it can.  Can we now merge in the successor as well?
         */
        if (next && end == next->vm_start &&
                mpol_equal(policy, vma_policy(next)) &&
                can_vma_merge_before(next, vm_flags,    
                    anon_vma, file, pgoff+pglen) && 
                is_mergeable_anon_vma(prev->anon_vma,
                              next->anon_vma)) {//如果前一个和后一个区域都可以与当前区域合并,还必须确认前一个和后一个区域的匿名映射可以合并
                            /* cases 1, 6 */
            vma_adjust(prev, prev->vm_start,
                next->vm_end, prev->vm_pgoff, NULL);
        } else                  /* cases 2, 5, 7 */
            vma_adjust(prev, prev->vm_start,
                end, prev->vm_pgoff, NULL);
        return prev;
    }

    /*
     * Can this new request be merged in front of next?
     */
    if (next && end == next->vm_start &&
            mpol_equal(policy, vma_policy(next)) &&
            can_vma_merge_before(next, vm_flags,
                    anon_vma, file, pgoff+pglen)) {
        if (prev && addr < prev->vm_end)    /* case 4 */
            vma_adjust(prev, prev->vm_start,
                addr, prev->vm_pgoff, NULL);
        else                    /* cases 3, 8 */
            vma_adjust(area, addr, next->vm_end,
                next->vm_pgoff - pglen, NULL);
        return area;
    }

    return NULL;
}

插入区域

insert_vm_struct是内核用于插入新区域的标准函数

/* Insert vm structure into process list sorted by address
 * and into the inode's i_mmap tree.  If vm_file is non-NULL
 * then i_mmap_lock is taken here.
 */
int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
{
    struct vm_area_struct * __vma, * prev;
    struct rb_node ** rb_link, * rb_parent;

    /*
     * The vm_pgoff of a purely anonymous vma should be irrelevant
     * until its first write fault, when page's anon_vma and index
     * are set.  But now set the vm_pgoff it will almost certainly
     * end up with (unless mremap moves it elsewhere before that
     * first wfault), so /proc/pid/maps tells a consistent story.
     *
     * By setting it to reflect the virtual start address of the
     * vma, merges and splits can happen in a seamless way, just
     * using the existing file pgoff checks and manipulations.
     * Similarly in do_mmap_pgoff and in do_brk.
     */
    if (!vma->vm_file) {
        BUG_ON(vma->anon_vma);
        vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
    }
    __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);//获取prev(前一个区域),rb_parent(保存新区域的父节点),rb_link(包含该区域自身的叶节点),以及
    if (__vma && __vma->vm_start < vma->vm_end)
        return -ENOMEM;
    if ((vma->vm_flags & VM_ACCOUNT) &&
         security_vm_enough_memory_mm(mm, vma_pages(vma)))
        return -ENOMEM;
    vma_link(mm, vma, prev, rb_link, rb_parent);
    return 0;
}

使用vma_link将新区域合并到该进程现存的数据结构中。

  • __vma_link_list将新区域放置到进程管理区域的线性链表上。完成该工作,只需提供使用 find_vma_prepare找到的前一个和后一个区域
  • __vma_link_rb将新区域连接到红黑树的数据结构中。
  • __anon_vma_link将vm_area_struct实例添加到匿名映射的链表,
  • __vma_link_file将相关的address_space和映射(如果是文件映射)关联起来,并使用 vma_prio_tree_insert将该区域添加到优先树中

创建区域

arch_get_unmapped_area:在进程的虚拟内存中查找适当的可用内存区域。

unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
        unsigned long len, unsigned long pgoff, unsigned long flags)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    unsigned long start_addr;

    if (len > TASK_SIZE)
        return -ENOMEM;

    if (flags & MAP_FIXED)
        return addr;

    if (addr) {
        addr = PAGE_ALIGN(addr);
        vma = find_vma(mm, addr);
        if (TASK_SIZE - len >= addr &&
            (!vma || addr + len <= vma->vm_start))
            return addr;
    }
    if (len > mm->cached_hole_size) {
            start_addr = addr = mm->free_area_cache;
    } else {
            start_addr = addr = TASK_UNMAPPED_BASE;
            mm->cached_hole_size = 0;
    }

full_search:
    for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
        /* At this point:  (!vma || addr < vma->vm_end). */
        if (TASK_SIZE - len < addr) {
            /*
             * Start a new search - just in case we missed
             * some holes.
             */
            if (start_addr != TASK_UNMAPPED_BASE) {
                addr = TASK_UNMAPPED_BASE;
                    start_addr = addr;
                mm->cached_hole_size = 0;
                goto full_search;
            }
            return -ENOMEM;
        }
        if (!vma || addr + len <= vma->vm_start) {
            /*
             * Remember the place where we stopped the search:
             */
            mm->free_area_cache = addr + len;
            return addr;
        }
        if (addr + mm->cached_hole_size < vma->vm_start)
                mm->cached_hole_size = vma->vm_start - addr;
        addr = vma->vm_end;
    }
}

地址空间

文件的内存映射可以认为是两个不同地址空间之间的映射。一个地址空间是用户进程的虚拟地址空间,另一个是文件系统所在的地址空间。

vm_operations_struct结构用于建立两个空间之间的关联。它提供了一个操作,来读取已经映射到虚拟地址空间、但其内容尚未进入物理内存的页。

address_space结构中包含address_space_operations结构,以提供地址空间的一组相关操作。

mm/filemap.c  
struct vm_operations_struct generic_file_vm_ops = {
    .fault = filemap_fault,  
}; 

内存映射

void *mmap(void *addr, size_t length, int prot, int flags,int fd, off_t offset);

创建映射

p251

  • 分配一个新的vm_area_struct实列,并插入到进程的链表/树数数据结构中
  • 用特定于文件的函数file->f_op->mmap创建映射。大多数文件系统将generic_file_mmap用于该目的。它所作的所有工作,就是将映射的vm_ops成员设置为generic_file_vm_ops。
vma->vm_ops = &generic_file_vm_ops;

generic_file_vm_ops的定义在上面已经给出。其关键要素是filemap_fault,在应用程序访问映射区域但对应数据不在物理内存时调用。

  • 如果设置了VM_LOCKED,或者通过系统调用的标志参数显式传递进来,或者通过mlockall机制隐式设置,内核都会调用make_pages_present依次扫描映射中各页,对每一页触发缺页异常以便读入 其数据

删除映射

p254

非线性映射

通过操作进程的页表完成的。

在换出非线性映射时,内核必须确保再次换入时,仍然要保持原来的偏移量。完成这一要求所需的信息存储在换出页的页表项中,再次换入时必须参考相应的信息。

  • 所有建立的非线性映射的vm_area_struct实例维护在一个链表中,表头是struct address_space的i_mmap_nonlinear成员。链表中的各个vm_area_struct实例可以采用shared.vm_set.list作为链表元素,因为在标准的优先树中不存在非线性映射区域
  • 所述区域对应的页表项用一些特殊的项填充。这些页表项看起来像是对应于不存在的页,但其中包含附加信息,将其标识为非线性映射的页表项。在访问此类页表项描述的页时,会产生一个缺 页异常,并读入正确的页。(pgoff_to_pte将文件偏移量编码为页号,并将其编码为一种可以存储在页表中的格式。 pte_to_pgoff可以解码页表中存储的编码过的文件偏移量。 pte_file(pte)检查给定的页表项是否用于表示非线性映射。)

反向映射

内核在实现逆向映射时采用的技巧是,不直接保存页和相关的使用者之间的关联,而只保存页和页所在区域之间的关联。包含该页的所有其他区域都可以通过刚才提到的数据结构找到(优先搜索树、指向内存中同一页的匿名区域的链表)。该方法又名基于对象的逆向映射,因为没有存储页和使用者之间的直接关联。相反,在两者之间插入了另一个对象。

参考资料:

1.http://www.wowotech.net/memory_management/reverse_mapping.html

2.https://www.ibm.com/developerworks/cn/linux/l-cn-pagerecycle/index.html

建立逆向映射

匿名页:page_add_new_anon_rmap\page_add_anon_rmap->__page_set_anon_rmap

基于文件映射的页:void page_add_file_rmap(struct page *page)

使用逆向映射

page_referenced是一个重要的函数。它统计了最近活跃地使用了某个共享页的进程的数目。

堆的管理

brk

缺页异常的处理

do_page_fault

<arch\x86\mm\fault_32.c>
fastcall void __kprobes do_page_fault(struct pt_regs *regs,
                      unsigned long error_code)
{
    struct task_struct *tsk;
    struct mm_struct *mm;
    struct vm_area_struct * vma;
    unsigned long address;
    int write, si_code;
    int fault;

    /*
     * We can fault from pretty much anywhere, with unknown IRQ state.
     */
    trace_hardirqs_fixup();

    /* get the address */
        address = read_cr2();//保存触发异常的地址

    tsk = current;

    si_code = SEGV_MAPERR;

    /*
     * We fault-in kernel-space virtual memory on-demand. The
     * 'reference' page table is init_mm.pgd.
     *
     * NOTE! We MUST NOT take any locks for this case. We may
     * be in an interrupt or a critical region, and should
     * only copy the information from the master page table,
     * nothing more.
     *
     * This verifies that the fault happens in kernel space
     * (error_code & 4) == 0, and that the fault was not a
     * protection error (error_code & 9) == 0.
     */
    /*我们因异常而进入到内核虚拟地址空间。
     *参考页表为init_mm.pgd。
     *
     *要注意!对这种情况我们不能获取任何锁。
     *我们可能是在中断或临界区中,
     *只应当从主页表复制信息,不允许其他操作
     *
     *下述代码验证了异常发生于内核空间(error_code&4) == 0,
     *而且异常不是保护错误(error_code & 9) == 0
     */
    if (unlikely(address >= TASK_SIZE)) {
        if (!(error_code & 0x0000000d) && vmalloc_fault(address) >= 0)//发生在核心态,且异常不是由保护错误触发时,内核使用vmalloc_fault同步页表(从init的页表复制相关的项到当前页表)
            return;
        if (notify_page_fault(regs))
            return;
        /*
         * Don't take the mm semaphore here. If we fixup a prefetch
         * fault we could otherwise deadlock.
         */
        /*不要在这里获取mm信号量。
         *如果修复了取指令造成的缺页异常,则会进入死锁。
         *
         */
        //如果异常是在中断期间或内核线程过程中触发,也没有自身的上下文因而也没有独立的mm_struct实列,则跳转
        goto bad_area_nosemaphore;
    }

    if (notify_page_fault(regs))
        return;

    /* It's safe to allow irq's after cr2 has been saved and the vmalloc
       fault has been handled. */
    if (regs->eflags & (X86_EFLAGS_IF|VM_MASK))
        local_irq_enable();

    mm = tsk->mm;

    /*
     * If we're in an interrupt, have no user context or are running in an
     * atomic region then we must not take the fault..
     */
    /*如果我们在中断期间,也没有用户上下文,或者代码处于原子操作范围内,则不能处理异常
     */
    if (in_atomic() || !mm)
        goto bad_area_nosemaphore;

    /* When running in the kernel we expect faults to occur only to
     * addresses in user space.  All other faults represent errors in the
     * kernel and should generate an OOPS.  Unfortunately, in the case of an
     * erroneous fault occurring in a code path which already holds mmap_sem
     * we will deadlock attempting to validate the fault against the
     * address space.  Luckily the kernel only validly references user
     * space from well defined areas of code, which are listed in the
     * exceptions table.
     *
     * As the vast majority of faults will be valid we will only perform
     * the source reference check when there is a possibility of a deadlock.
     * Attempt to lock the address space, if we cannot we then validate the
     * source.  If this is invalid we can skip the address space check,
     * thus avoiding the deadlock.
     */
    if (!down_read_trylock(&mm->mmap_sem)) {
        if ((error_code & 4) == 0 &&
            !search_exception_tables(regs->eip))
            goto bad_area_nosemaphore;
        down_read(&mm->mmap_sem);
    }
    //如果异常并非出现在中断期间,也有相关的上下文,则内核检查进程的地址空间是否包含异常地址所在区域
    vma = find_vma(mm, address);
    if (!vma)
        goto bad_area;
    if (vma->vm_start <= address)
        goto good_area;
    if (!(vma->vm_flags & VM_GROWSDOWN))
        goto bad_area;
    if (error_code & 4) {
        /*
         * Accessing the stack below %esp is always a bug.
         * The large cushion allows instructions like enter
         * and pusha to work.  ("enter $65535,$31" pushes
         * 32 pointers and then decrements %esp by 65535.)
         */
        if (address + 65536 + 32 * sizeof(unsigned long) < regs->esp)
            goto bad_area;
    }
    if (expand_stack(vma, address))//增大栈
        goto bad_area;
/*
 * Ok, we have a good vm_area for this memory access, so
 * we can handle it..
 */
good_area:
    si_code = SEGV_ACCERR;
    write = 0;
    switch (error_code & 3) {
        default:    /* 3: write, present *///写,不缺页
                /* fall through */
        case 2:     /* write, not present *///写,缺页
            if (!(vma->vm_flags & VM_WRITE))
                goto bad_area;
            write++;
            break;
        case 1:     /* read, present *///读,不缺页
            goto bad_area;
        case 0:     /* read, not present *///读,缺页
            if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
                goto bad_area;
    }

 survive:
    /*
     * If for any reason at all we couldn't handle the fault,
     * make sure we exit gracefully rather than endlessly redo
     * the fault.
     */
     /*
     如果由于某些原因我们无法处理异常,则必须优雅地推出,而不是一直重试。
     */
    fault = handle_mm_fault(mm, vma, address, write);//修正缺页异常(按需调入、换入等),返回值VM_FAULT_MINOR:数据已经在内存中,VM_FAULT_MAJOR:数据需要从块设备读取
    if (unlikely(fault & VM_FAULT_ERROR)) {
        if (fault & VM_FAULT_OOM)//内存不足
            goto out_of_memory;
        else if (fault & VM_FAULT_SIGBUS)//其他原因,发送信号给进程
            goto do_sigbus;
        BUG();
    }
    if (fault & VM_FAULT_MAJOR)
        tsk->maj_flt++;
    else
        tsk->min_flt++;

    /*
     * Did it hit the DOS screen memory VA from vm86 mode?
     */
    if (regs->eflags & VM_MASK) {
        unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT;
        if (bit < 32)
            tsk->thread.screen_bitmap |= 1 << bit;
    }
    up_read(&mm->mmap_sem);
    return;

/*
 * Something tried to access memory that isn't in our memory map..
 * Fix it, but check if it's kernel or user first..
 */
bad_area:
    up_read(&mm->mmap_sem);

bad_area_nosemaphore:
    /* User mode accesses just cause a SIGSEGV */
    if (error_code & 4) {//用户态的访问导致了SIGSEGV(返回段错误)
        /*
         * It's possible to have interrupts off here.
         */
        local_irq_enable();

        /* 
         * Valid to do another page fault here because this one came 
         * from user space.
         */
        if (is_prefetch(regs, address, error_code))
            return;

        if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
            printk_ratelimit()) {
            printk("%s%s[%d]: segfault at %08lx eip %08lx "
                "esp %08lx error %lx\n",
                task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
                tsk->comm, task_pid_nr(tsk), address, regs->eip,
                regs->esp, error_code);
        }
        tsk->thread.cr2 = address;
        /* Kernel addresses are always protection faults */
        tsk->thread.error_code = error_code | (address >= TASK_SIZE);
        tsk->thread.trap_no = 14;
        force_sig_info_fault(SIGSEGV, si_code, address, tsk);
        return;
    }

#ifdef CONFIG_X86_F00F_BUG
    /*
     * Pentium F0 0F C7 C8 bug workaround.
     */
    if (boot_cpu_data.f00f_bug) {
        unsigned long nr;
        
        nr = (address - idt_descr.address) >> 3;

        if (nr == 6) {
            do_invalid_op(regs, 0);
            return;
        }
    }
#endif

no_context:
    /* Are we prepared to handle this kernel fault?  */
    /* 准备好处理这个内核异常了吗?(内核空间)(做最后的校正尝试)*/
    if (fixup_exception(regs))
        return;

    /* 
     * Valid to do another page fault here, because if this fault
     * had been triggered by is_prefetch fixup_exception would have 
     * handled it.
     */
    if (is_prefetch(regs, address, error_code))
        return;

/*
 * Oops. The kernel tried to access some bad page. We'll have to
 * terminate things with extreme prejudice.
 */

    bust_spinlocks(1);
    //oops
    if (oops_may_print()) {
        __typeof__(pte_val(__pte(0))) page;

#ifdef CONFIG_X86_PAE
        if (error_code & 16) {
            pte_t *pte = lookup_address(address);

            if (pte && pte_present(*pte) && !pte_exec_kernel(*pte))
                printk(KERN_CRIT "kernel tried to execute "
                    "NX-protected page - exploit attempt? "
                    "(uid: %d)\n", current->uid);
        }
#endif
        if (address < PAGE_SIZE)
            printk(KERN_ALERT "BUG: unable to handle kernel NULL "
                    "pointer dereference");
        else
            printk(KERN_ALERT "BUG: unable to handle kernel paging"
                    " request");
        printk(" at virtual address %08lx\n",address);
        printk(KERN_ALERT "printing eip: %08lx ", regs->eip);

        page = read_cr3();
        page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
#ifdef CONFIG_X86_PAE
        printk("*pdpt = %016Lx ", page);
        if ((page >> PAGE_SHIFT) < max_low_pfn
            && page & _PAGE_PRESENT) {
            page &= PAGE_MASK;
            page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
                                                     & (PTRS_PER_PMD - 1)];
            printk(KERN_CONT "*pde = %016Lx ", page);
            page &= ~_PAGE_NX;
        }
#else
        printk("*pde = %08lx ", page);
#endif

        /*
         * We must not directly access the pte in the highpte
         * case if the page table is located in highmem.
         * And let's rather not kmap-atomic the pte, just in case
         * it's allocated already.
         */
        if ((page >> PAGE_SHIFT) < max_low_pfn
            && (page & _PAGE_PRESENT)
            && !(page & _PAGE_PSE)) {
            page &= PAGE_MASK;
            page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
                                                     & (PTRS_PER_PTE - 1)];
            printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
        }

        printk("\n");
    }

    tsk->thread.cr2 = address;
    tsk->thread.trap_no = 14;
    tsk->thread.error_code = error_code;
    die("Oops", regs, error_code);
    bust_spinlocks(0);
    do_exit(SIGKILL);

/*
 * We ran out of memory, or some other thing happened to us that made
 * us unable to handle the page fault gracefully.
 */
out_of_memory:
    up_read(&mm->mmap_sem);
    if (is_global_init(tsk)) {
        yield();
        down_read(&mm->mmap_sem);
        goto survive;
    }
    printk("VM: killing process %s\n", tsk->comm);
    if (error_code & 4)
        do_group_exit(SIGKILL);
    goto no_context;

do_sigbus:
    up_read(&mm->mmap_sem);

    /* Kernel mode? Handle exceptions or die */
    if (!(error_code & 4))
        goto no_context;

    /* User space => ok to do another page fault */
    if (is_prefetch(regs, address, error_code))
        return;

    tsk->thread.cr2 = address;
    tsk->thread.error_code = error_code;
    tsk->thread.trap_no = 14;
    force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}

用户空间缺页异常的校正

在结束对缺页异常的特定于体系结构的分析之后,确认异常是在允许的地址触发,内核必须确定将所需数据读取到物理内存的适当方法。该任务委托给handle_mm_fault,它不依赖于底层体系结构,而是在内存管理的框架下、独立于系统而实现。该函数确认在各级页目录中,通过对应于异常地址的页表项的各个目录项都存在。handle_pte_fault函数分析缺页异常的原因。pte是指向相关页表项的指针。

<mm/memory.c>

static inline int handle_pte_fault(struct mm_struct *mm,
        struct vm_area_struct *vma, unsigned long address,
        pte_t *pte, pmd_t *pmd, int write_access)
{
    pte_t entry;
    spinlock_t *ptl;

    entry = *pte;
    if (!pte_present(entry)) {//如果页不在物理内存中
        if (pte_none(entry)) {//没有对应的页表项
            if (vma->vm_ops) {//基于文件的映射,按需调页
                if (vma->vm_ops->fault || vma->vm_ops->nopage)
                    return do_linear_fault(mm, vma, address,
                        pte, pmd, write_access, entry);
                if (unlikely(vma->vm_ops->nopfn))
                    return do_no_pfn(mm, vma, address, pte,
                             pmd, write_access);
            }
            return do_anonymous_page(mm, vma, address,
                         pte, pmd, write_access);//匿名页:按需分配
        }
        if (pte_file(entry))
            return do_nonlinear_fault(mm, vma, address,
                    pte, pmd, write_access, entry);//换入非线性映射
        return do_swap_page(mm, vma, address,
                    pte, pmd, write_access, entry);//换入
    }

    ptl = pte_lockptr(mm, pmd);
    spin_lock(ptl);
    if (unlikely(!pte_same(*pte, entry)))
        goto unlock;
    if (write_access) {//如果该区域对页授予了写权限,而硬件的存储机制没有授予,则COW
        if (!pte_write(entry))
            return do_wp_page(mm, vma, address,
                    pte, pmd, ptl, entry);
        entry = pte_mkdirty(entry);
    }
    entry = pte_mkyoung(entry);
    if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
        update_mmu_cache(vma, address, entry);
    } else {
        /*
         * This is needed only for protection faults but the arch code
         * is not yet telling us if this is a protection fault or not.
         * This still avoids useless tlb flushes for .text page faults
         * with threads.
         */
        if (write_access)
            flush_tlb_page(vma, address);
    }
unlock:
    pte_unmap_unlock(pte, ptl);
    return 0;
}

按需分配/调页

static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pmd_t *pmd,
        pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
    pte_t *page_table;
    spinlock_t *ptl;
    struct page *page;
    pte_t entry;
    int anon = 0;
    struct page *dirty_page = NULL;
    struct vm_fault vmf;
    int ret;
    int page_mkwrite = 0;

    vmf.virtual_address = (void __user *)(address & PAGE_MASK);
    vmf.pgoff = pgoff;
    vmf.flags = flags;
    vmf.page = NULL;

    BUG_ON(vma->vm_flags & VM_PFNMAP);
    //将所需数据读入到发生异常的页(内核使用address_space对象中的信息,从后备存储器将数据读取到物理内存页)
    if (likely(vma->vm_ops->fault)) {
        ret = vma->vm_ops->fault(vma, &vmf);
        if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
            return ret;
    } else {
        /* Legacy ->nopage path */
        ret = 0;
        vmf.page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
        /* no page was available -- either SIGBUS or OOM */
        if (unlikely(vmf.page == NOPAGE_SIGBUS))
            return VM_FAULT_SIGBUS;
        else if (unlikely(vmf.page == NOPAGE_OOM))
            return VM_FAULT_OOM;
    }

    /*
     * For consistency in subsequent calls, make the faulted page always
     * locked.
     */
    if (unlikely(!(ret & VM_FAULT_LOCKED)))
        lock_page(vmf.page);
    else
        VM_BUG_ON(!PageLocked(vmf.page));

    /*
     * Should we do an early C-O-W break?
     */
    page = vmf.page;
    if (flags & FAULT_FLAG_WRITE) {
        if (!(vma->vm_flags & VM_SHARED)) {//私有映射
            anon = 1;
            if (unlikely(anon_vma_prepare(vma))) {
                ret = VM_FAULT_OOM;
                goto out;
            }
            page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
                        vma, address);
            if (!page) {
                ret = VM_FAULT_OOM;
                goto out;
            }
            copy_user_highpage(page, vmf.page, address, vma);
        } else {//共享映射
            /*
             * If the page will be shareable, see if the backing
             * address space wants to know that the page is about
             * to become writable
             */
            if (vma->vm_ops->page_mkwrite) {
                unlock_page(page);
                if (vma->vm_ops->page_mkwrite(vma, page) < 0) {
                    ret = VM_FAULT_SIGBUS;
                    anon = 1; /* no anon but release vmf.page */
                    goto out_unlocked;
                }
                lock_page(page);
                /*
                 * XXX: this is not quite right (racy vs
                 * invalidate) to unlock and relock the page
                 * like this, however a better fix requires
                 * reworking page_mkwrite locking API, which
                 * is better done later.
                 */
                if (!page->mapping) {
                    ret = 0;
                    anon = 1; /* no anon but release vmf.page */
                    goto out;
                }
                page_mkwrite = 1;
            }
        }

    }

    page_table = pte_offset_map_lock(mm, pmd, address, &ptl);

    /*
     * This silly early PAGE_DIRTY setting removes a race
     * due to the bad i386 page protection. But it's valid
     * for other architectures too.
     *
     * Note that if write_access is true, we either now have
     * an exclusive copy of the page, or this is a shared mapping,
     * so we can make it writable and dirty to avoid having to
     * handle that later.
     */
    /* Only go through if we didn't race with anybody else... */
    if (likely(pte_same(*page_table, orig_pte))) {
        flush_icache_page(vma, page);
        entry = mk_pte(page, vma->vm_page_prot);
        if (flags & FAULT_FLAG_WRITE)
            entry = maybe_mkwrite(pte_mkdirty(entry), vma);
        set_pte_at(mm, address, page_table, entry);
        if (anon) {
                        inc_mm_counter(mm, anon_rss);
                        lru_cache_add_active(page);
                        page_add_new_anon_rmap(page, vma, address);//建立匿名页反向映射
        } else {
            inc_mm_counter(mm, file_rss);
            page_add_file_rmap(page);//建立映射页反向映射
            if (flags & FAULT_FLAG_WRITE) {
                dirty_page = page;
                get_page(dirty_page);
            }
        }

        /* no need to invalidate: a not-present page won't be cached */
        update_mmu_cache(vma, address, entry);
    } else {
        if (anon)
            page_cache_release(page);
        else
            anon = 1; /* no anon but release faulted_page */
    }

    pte_unmap_unlock(page_table, ptl);

out:
    unlock_page(vmf.page);
out_unlocked:
    if (anon)
        page_cache_release(vmf.page);
    else if (dirty_page) {
        if (vma->vm_file)
            file_update_time(vma->vm_file);

        set_page_dirty_balance(dirty_page, page_mkwrite);
        put_page(dirty_page);
    }

    return ret;
}

给定涉及区域的vm_area_struct,内核选择使用何种方法读取页?

  • 使用vm_area_struct->vm_file找到映射的file对象
  • 在file->f_mapping中找到指向映射自身的指针
  • 每个地址空间都有特定的地址空间操作,从中选择readpage方法

匿名页

对于没有关联到文件作为后备存储器的页,需要调用do_anonymous_page进行映射。除了无需向页读入数据之外,该过程几乎与映射基于文件的数据没什么不同。在highmem内存域建立一个新页,并清空其内容。接下来将页加入到进程的页表,并更新高速缓存或者MMU。

写时复制

首先调用vm_normal_page,找到struct page实例。

alloc_page_vma分配一个新页。

cow_user_page接下来将异常页的数据复制到新页。

后使用page_remove_rmap,删除到原来的只读页的逆向映射。

page_add_anon_rmap将新页插入到逆向映射数据结构。

内核缺页异常

vmalloc的情况是导致缺页异常的合理原因。必须加以校正。直至对应的缺页异常发送之前,vmalloc区域中的修改都不会传输到进程页表。必须从主页表复制适当的访问权限信息。

在处理不是由于访问vmalloc区域导致的缺页异常时,异常修正(exception fixup)机制是一个最后手段。

每次发生缺页异常时,将输出异常的原因和当前执行代码的地址。这使得内核可以编译一个列表, 列出所有可能执行未授权内存访问操作的危险代码块。这个“异常表”在链接内核映像时创建,在二 进制文件中位于__start_exception_table和__end_exception_table之间。每个表项都对应于一 个struct exception_table实例,该结构尽管是体系结构相关的,

<include/asm-x86/uaccess_32.h>  

struct exception_table_entry  {    
    unsigned long insn, fixup;  
}; 

fixup_exception用于搜索异常表.

arch/x86/mm/extable_32.c  
int fixup_exception(struct pt_regs *regs)  {    
    const struct exception_table_entry *fixup;  
 
    fixup = search_exception_tables(regs->eip);    
    if (fixup) {
        regs->eip = fixup->fixup;
        return 1;
    }  
 
    return 0;  
}  

在内核和用户空间之间复制数据

1.检查指针是否指向用户空间中的位置

2.确认页是否在物理内存中,否则调用handle_mm_fault读入页

3.利用异常校正(exception fixup)机制修复坏指针。

猜你喜欢

转载自www.cnblogs.com/r1ng0/p/10570546.html