深入理解Linux内核第3版--笔记-2.pdf

Chapter 8. Memory Management

      8.1. Page Frame Management

         8.1.1. Page Descriptors

                 State information of a page frame is kept in a page descriptor of type page

            All page descriptors are stored in the mem_map array.

            virt_to_page(addr)

                 pfn_to_page(pfn)

 

       

           

            8.1.2. Non-Uniform Memory Access (NUMA)

 

                 The physical memory inside each node can be split into several zones, as we will see in the next

                 section. Each node has a descriptor of type pg_data_t,

 

          

 

                 8.1.3. Memory Zones

                      Linux 2.6 partitions the physical memory of every memory node

                      into three zones. In the 80 x 86 UMA architecture the zones are:

                      ZONE_DMA

                      Contains page frames of memory below 16 MB

              ZONE_NORMAL

                      Contains page frames of memory at and above 16 MB and below 896 MB

              ZONE_HIGHMEM

                      Contains page frames of memory at and above 896 MB

 

                      The ZONE_DMA and ZONE_NORMAL zones include the "normal" page frames that can be directly accessed

                      by the kernel through the linear mapping in the fourth gigabyte of the linear address space (see the

                      section "Kernel Page Tables" in Chapter 2). Conversely, the ZONE_HIGHMEM zone includes page frames

                      that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of

                      linear address space (see the section "Kernel Mappings of High-Memory Page Frames" later in this

                      chapter). The ZONE_HIGHMEM zone is always empty on 64-bit architectures.

 

                      Each memory zone has its own descriptor of type zone. Its fields are shown in Table 8-4.

                

          

 

                

                 8.1.4. The Pool of Reserved Page Frames

                      min_free_kbytes,

                initially min_free_kbytes cannot be lower than 128 and greater than 65,536

 

                      The pages_min field of the zone descriptor stores the number of reserved page frames inside the

                      zone. As we'll see in Chapter 17, this field plays also a role for the page frame reclaiming algorithm,

                      together with the pages_low and pages_high fields. The pages_low field is always set to 5/4 of the

                      value of pages_min, and pages_high is always set to 3/2 of the value of pages_min

                 8.1.5. The Zoned Page Frame Allocator

                      

 

                      8.1.5.1. Requesting and releasing page frames

                            alloc_pages(gfp_mask, order)

              alloc_page(gfp_mask)

              Macro used to request 2order contiguous page frames. It returns the address of the descriptor

                      of the first allocated page frame or returns NULL if the allocation failed.

 

              _ _get_free_pages(gfp_mask, order

              _ _get_free_page(gfp_mask)

              get_zeroed_page(gfp_mask)

              _ _get_dma_pages(gfp_mask, order)

              but it returns the linear address of the first allocated page.

 

 

                      _ _free_pages(page, order)

              _ _free_page(page)

             

              This function checks the page descriptor pointed to by page; if the page frame is not reserved

                      (i.e., if the PG_reserved flag is equal to 0), it decreases the count field of the descriptor. If

                count becomes 0, it assumes that 2order contiguous page frames starting from the one

                      corresponding to page are no longer used. In this case, the function releases the page frames

                      as explained in the later section

                           

                      free_pages(addr, order)

              free_page(addr)

                       but it receives as an argument the linear address addr of the first page frame to be released.

 

                 8.1.6. Kernel Mappings of High-Memory Page Frames????

                      The kernel uses three different mechanisms to map page frames in high memory; they are called

                      permanent kernel mapping, temporary kernel mapping, and noncontiguous memory allocation. In

                      this section, we'll cover the first two techniques; the third one is discussed in the section

                      "Noncontiguous Memory Area Management" later in this chapter

 

                      8.1.6.1. Permanent kernel mappings 

                      page_address( );

                The page_address( ) function returns the linear address associated with the page frame, or NULL if the page frame is in high memory and is not                     mapped.

                kmap_high()
               
The kmap_high( ) function is invoked if the page frame really belongs to high memory.

                kunmap( )

                The kunmap( ) function destroys a permanent kernel mapping established previously by kmap( ).

               

                      8.1.6.2. Temporary kernel mappings

                       kmap_atomic( )

 

                 8.1.7. The Buddy System Algorithm

                      The technique adopted by Linux to solve the external fragmentation problem is based on the wellknown

                      buddy system algorithm. All free page frames are grouped into 11 lists of blocks that contain

                      groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively. The

                      largest request of 1024 page frames corresponds to a chunk of 4 MB of contiguous RAM. The

                      physical address of the first page frame of a block is a multiple of the group size.for example, the

                      initial address of a 16-page-frame block is a multiple of 16 x 212 (212 = 4,096, which is the regular

                      page size).

 

                      8.1.7.1. Data structures

                            1:zone->zone_mem_map Pointer to first page descriptor of the zone.

 

                      2:An array consisting of eleven elements of type free_area, one element for each group size.

                      The array is stored in the free_area field of the zone descriptor.

                            zone->free_area [k]

 

                            8.1.7.2. Allocating a block

                            The _ _rmqueue( ) function is used to find a free block in a zone

 

                      8.1.7.3. Freeing a block

                            _ _free_pages_bulk( )/__free_one_page()

                      function implements the buddy system strategy for freeing page frames

 

                 8.1.8. The Per-CPU Page Frame Cache

                      The main data structure implementing the per-CPU page frame cache is an array of per_cpu_pageset

                      data structures stored in the pageset field of the memory zone descriptor. The array includes one

                      element for each CPU; this element, in turn, consists of two per_cpu_pages descriptors, one for the

                      hot cache and the other for the cold cache. The fields of the per_cpu_pages descriptor are listed in

                      Table 8-7. The fields of the per_cpu_pages descriptor

                            Type Name Description

                int count Number of pages frame in the cache

                int low Low watermark for cache replenishing

                int high High watermark for cache depletion

                int batch Number of page frames to be added or subtracted from the cache

                struct list_head list List of descriptors of the page frames included in the cache

 

                            8.1.8.1. Allocating page frames through the per-CPU page frame caches

                            buffered_rmqueue( )

                    8.1.8.2. Releasing page frames to the per-CPU page frame caches

                                   free_hot_cold_page( )

 

                 8.1.9. The Zone Allocator

                            _ _alloc_pages( )-->zone_watermark_ok( )

                    _ _free_pages( )-->__free_one_page()

                     

           8.2. Memory Area Management

       

                 8.2.1. The Slab Allocator

                 Figure 8-3. The slab allocator components

                

                 8.2.2. Cache Descriptor

                 1:   Each cache is described by a structure of type kmem_cache_t(eg:kmem_cache)

                    Table 8-8. The fields of the kmem_cache_t descriptor

              Type                                Name                               Description

                     struct array_cache *array[]     array                   Per-CPU array of pointers to local caches of free objects (see the section                                                                                  "Local Caches of Free Slab Objects" later in this chapter).

 

            unsigned int                batchcount              Number of objects to be transferred in bulk to or from the local caches.

 

            unsigned int                limit                   Maximum number of free objects in the local caches. This is tunable.

                     struct kmem_list3           lists                   See next table.

                 unsigned int                objsize             Size of the objects included in the cache

            unsigned int                flags                           Set of flags that describes permanent properties of the cache.

                 unsigned int                num                            Number of objects packed into a single slab. (All slabs of the cache

                                                                                    have the same size.)

                 unsigned int                free_limit                     Upper limit of free objects in the whole slab cache

                 spinlock_t                     spinlock                  Cache spin lock.

                 unsigned int                gfporder                      Logarithm of the number of contiguous page frames included in a single slab.

                 unsigned int                gfpflags                 Set of flags passed to the buddy system function when allocating page                                                                                    frames.

                 size_t                  colour                          Number of colors for the slabs (see the section "Slab Coloring" later

                                                                                    in this chapter).

                 unsigned int                colour_off                    Basic alignment offset in the slabs.

 

                 unsigned int                colour_next                  Color to use for the next allocated slab.

                           

                 kmem_cache_t*            slabp_cache                  Pointer to the general slab cache containing the slab descriptors

                                                                                    (NULL if internal slab descriptors are used; see next section).

 

                 unsigned int                  slab_size                      The size of a single slab

                 unsigned int                dflags                          Set of flags that describe dynamic properties of the cache

                 void *                          ctor                       Pointer to destructor method associated with the cache

                 void *                          dtor                       Pointer to destructor method associated with the cache

                 const char *                 name                          Character array storing the name of the cache

                 struct list_head             next                            Pointers for the doubly linked list of cache descriptors.

 

                           

 

                 The CFLGS_OFF_SLAB flag in the flags field of the cache descriptor is set to one if the slab descriptor is stored outside the slab; it is set to zero                         otherwise.

            2:   The lists field of the kmem_cache_t descriptor

        

        

                 8.2.3. Slab Descriptor

                            kmem_cache->flags :

                    The CFLGS_OFF_SLAB flag in the flags field of the cache descriptor

                            is set to one if the slab descriptor is stored outside the slab;

 

                            External slab descriptor

                            Internal slab descriptor

                                                               

     

                            Figure 8-4. Relationship between cache and slab descriptors

 

                  

 

           8.2.4. General and Specific Caches

 

                 general caches are:

                      1:   A first cache called kmem_cache whose objects are the cache descriptors of the remaining

                            caches used by the kernel. The cache_cache variable contains the descriptor of this special cache.

 

                      2:        Several additional caches contain general purpose memory areas. The range of the memory

                            area sizes typically includes 13 geometrically distributed sizes. A table called malloc_sizes

                            (whose elements are of type cache_sizes) points to 26 cache descriptors associated with

                            memory areas of size 32, 64, 128, 256, 512, 1,024, 2,048, 4,096, 8,192, 16,384, 32,768,

                            65,536, and 131,072 bytes. For each size, there are two caches: one suitable for ISA DMA

                            allocations and the other for normal allocations.

                 specific caches

              Specific caches are created by the kmem_cache_create( ) function. Depending on the parameters,

                      the function first determines the best way to handle the new cache (for instance, whether to include

                      the slab descriptor inside or outside of the slab). It then allocates a cache descriptor for the new

                      cache from the cache_cache general cache and inserts the descriptor in the cache_chain list of cache

                      descriptors (the insertion is done after having acquired the cache_chain_sem semaphore that

                      protects the list from concurrent accesses).

                     

                      It is also possible to destroy a cache and remove it from the cache_chain list by invoking

                kmem_cache_destroy( ). This function is mostly useful to modules that create their own caches when

                      loaded and destroy them when unloaded. To avoid wasting memory space, the kernel must destroy

                      all slabs before destroying the cache itself. The kmem_cache_shrink( ) function destroys all the slabs

                      in a cache by invoking slab_destroy( ) iteratively (see the later section "Releasing a Slab from a

                      Cache").

 

                      The names of all general and specific caches can be obtained at runtime by reading /proc/slabinfo;

                      this file also specifies the number of free objects and the number of allocated objects in each cach

                 8.2.5. Interfacing the Slab Allocator with the Zoned Page Frame Allocator

                            kmem_getpages( )

                    kmem_freepages( )

            8.2.6. Allocating a Slab to a Cache

                            cache_ grow( )

            8.2.7. Releasing a Slab from a Cache

                            slab_dest

            8.2.8. Object Descriptor

                      Internal object descriptors

                      External object descriptors

 

                      The first object descriptor in the array describes the first object in the slab, and so on. An object

                      descriptor is simply an unsigned short integer, which is meaningful only when the object is free. It

                      contains the index of the next free object in the slab, thus implementing a simple list of free objects

                      inside the slab. The object descriptor of the last element in the free object list is marked by the

                      conventional value BUFCTL_END (0xffff).

 

                      Figure 8-5. Relationships between slab and object descriptors

   

             

 

                 8.2.9. Aligning Objects in Memory

 

                 8.2.10. Slab Coloring

                 Objects that have the same offset within different slabs will end up mapped in the same cache line.

                 The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM

                 locations, while other cache lines go underutilized.

 

                 policy called slab coloring : different arbitrary values called colors are assigned to the slabs.

                 Figure 8-6. Slab with color col and alignment aln

 

        

 

                 8.2.11. Local Caches of Free Slab Objects

                      cache of the slab allocator includes a per-CPU data structure consisting of a small array of pointers to freed objects called the slab local                       cache, the slab data structures get involved only when the local cache underflows or overflows

 

                      kmem_cache->array

                 Table 8-11. The fields of the array_cache structure

 

                 Type                         Name                               Description

 

                 unsigned int            avail                   Number of pointers to available objects in the local cache. The field also

                                                                   acts as the index of the first free slot in the cache.

 

                 unsigned int                 limit                   Size of the local cachethat is, the maximum number of pointers in the

                                                                   local cache.

                 unsigned int            batchcount              Chunk size for local cache refill or empty

 

                 unsigned int            touched             Flag set to 1 if the local cache has been recently used

 

          

                 8.2.12. Allocating a Slab Object

                 kmem_cache_alloc( )

            -->cache_alloc_refill( )

 

            8.2.13. Freeing a Slab Object

                 kmem_cache_free( )

            -->cache_flusharray( )

 

            8.2.14. General Purpose Objects

                 kmalloc( )

            kfree()

 

                 8.2.15. Memory Pools

 

                 "The Pool of Reserved Page Frames."

                 those page frames can be used only to satisfy atomic memory allocation requests issued by interrupt handlers or inside critical regions.

                 Memory Pools

                 is a reserve of dynamic memory that can be used only by a specific kernel component, namely the "owner" of the pool

          

                 A memory pool is described by a mempool_t object

 

                 Table 8-12. The fields of the mempool_t object

                     Type                  Name                        Description

 

                 spinlock_t      lock                Spin lock protecting the object fields

                 int         min_nr              Minimum number of elements in the memory pool

                 int         curr_nr         Current number of elements in the memory pool

                 void **     elements            Pointer to an array of pointers to the reserved elements

            void *          pool_data           Private data available to the pool's owner

                 mempool_alloc_t *   alloc               Method to allocate an element

                 mempool_free_t *    free                Method to free an element

            wait_queue_head_t   wait                Wait queue used when the memory pool is empty

 

                 mempool_create( )

            mempool_destroy( )

 

            mempool_alloc( )

                 mempool_free( )

 

        8.3. Noncontiguous Memory Area Management

            it makes sense to consider an allocation scheme based on noncontiguous page frames accessed through contiguous linear

                 addresses . The main advantage of this schema is to avoid external fragmentation,

            8.3.1. Linear Addresses of Noncontiguous Memory Areas

           

            Figure 8-7. The linear address interval starting from PAGE_OFFSET

           

          

 

                 1:   The beginning of the area includes the linear addresses that map the first 896 MB of RAM (see

                      the section "Process Page Tables" in Chapter 2); the linear address that corresponds to the end

                      of the directly mapped physical memory is stored in the high_memory variable.

 

                 2:   The end of the area contains the fix-mapped linear addresses (see the section "Fix-Mapped

                      Linear Addresses" in Chapter 2).

 

                 3:   The remaining linear addresses can be used for noncontiguous memory areas. A safety interval

                      of size 8 MB (macro VMALLOC_OFFSET) is inserted between the end of the physical memory

                      mapping and the first memory area; its purpose is to "capture" out-of-bounds memory

                      accesses. For the same reason, additional safety intervals of size 4 KB are inserted to separate

                      noncontiguous memory areas.

 

           8.3.2. Descriptors of Noncontiguous Memory Areas

                

                 Each noncontiguous memory area is associated with a descriptor of type vm_struct

 

            Table 8-13. The fields of the vm_struct descriptor

 

            Type                   Name                      Description

            void *      addr                Linear address of the first memory cell of the area

            unsigned long   size                Size of the area plus 4,096 (inter-area safety interval)

            unsigned long   flags               Type of memory mapped by the noncontiguous memory area

            struct page ** pages               Pointer to array of nr_pages pointers to page descriptors

                 unsigned int            nr_pages                Number of pages filled by the area

                 unsigned long          phys_addr              Set to 0 unless the area has been created to map the I/O shared

                                                        memory of a hardware device

                 struct vm_struct *    next                     Pointer to next vm_struct structure

 

        1:  These descriptors are inserted in a simple list by means of the next field; the address of the first element of the list is stored in the vmlist variable.

        2:  The flags field identifies the type of memory mapped by the area:

                  VM_ALLOC for pages obtained by means of vmalloc( ),

                  VM_MAP for already allocated pages mapped by means of vmap() (see the next section), and

                  VM_IOREMAP for on-board memory of hardware devices mapped by means of ioremap( ) (see Chapter 13).

 

        3:  The get_vm_area( ) function looks for a free range of linear addresses between VMALLOC_START and VMALLOC_END.

 

 

        8.3.3. Allocating a Noncontiguous Memory Area

                 vmalloc( )

                 The last crucial step consists of fiddling with the page table entries used by the kernel to

                 indicate that each page frame allocated to the noncontiguous memory area is now associated with a

                 linear address included in the interval of contiguous linear addresses yielded by vmalloc( ). This is

                 what map_vm_area( ) does.

 

 

        8.3.4. Releasing a Noncontiguous Memory Area

 

                 vfree( )

 

            -->remove_vm_area( )

 

 

Chapter 9. Process Address Space

 

        9.1. The Process's Address Space

            The kernel represents intervals of linear addresses by means of resources called memory regions

        Table 9-1. System calls related to memory region creation and deletion

                     System call                                   Description

            brk( )                  Changes the heap size of the process

            execve( )                   Loads a new executable file, thus changing the process address space

            _exit( )                    Terminates the current process and destroys its address space       

            fork( )                 Creates a new process, and thus a new address space

            mmap( ), mmap2( )                Creates a memory mapping for a file, thus enlarging the process address space

            mremap( )                   Expands or shrinks a memory region

            remap_file_pages()              Creates a non-linear mapping for a file (see Chapter 16)

            munmap( )                   Destroys a memory mapping for a file, thus contracting the process address space

            shmat( )                    Attaches a shared memory region

            shmdt( )                    Detaches a shared memory region

 

        9.2. The Memory Descriptor

            mm_struct

 

            Table 9-2. The fields of the memory descriptor

 

            Type                                Field                          Description

            struct vm_area_struct*      mmap                Pointer to the head of the list of memory region objects

            struct rb_root          mm_rb               Pointer to the root of the red-black tree of memory region objects

            struct vm_area_struct*      mmap_cache          Pointer to the last referenced memory region object

 

            unsigned long(*)( )              get_unmapped_area      Method that searches an available linear address interval in

                                                                   the process address space

            void (*)( )                    unmap_area           Method invoked when releasing a linear address interval

 

            unsigned long                     mmap_base            Identifies the linear address of the first allocated

                                                                   anonymous memory region or file memory mapping (see

                                                                   the section "Program Segments and Process Memory

                                                                   Regions" in Chapter 20)

 

            unsigned long                     free_area_cache           Address from which the kernel will look for a free interval of

                                                linear addresses in the process address space

                 pgd_t *             pgd             Pointer to the Page Global Directory

 

                 atomic_t                mm_users            Secondary usage counter

   

            atomic_t                mm_count            Main usage counter

 

            struct rw_semaphore         mmap_sem            Memory regions' read/write semaphore

          

                 spinlock_t              page_table_lock     Memory regions' and Page Tables' spin lock

 

                 struct list_head            mmlist              Pointers to adjacent elements in the list of memory descriptors

 

            unsigned long           start_code          Initial address of executable code

            unsigned long           end_code            Final address of executable code

            unsigned long           start_data          Initial address of initialized data

            unsigned long           end_data            Final address of initialized data

                 unsigned long           start_brk           Initial address of the heap

                 unsigned long               brk                Current final address of the heap

 

                 unsigned long           start_stack         Initial address of User Mode stack

            unsigned long           arg_start           Initial address of command-line arguments

            unsigned long           arg_end         Final address of command-line arguments

            unsigned long           env_start           Initial address of environment variables

            unsigned long           env_end         Final address of environment variables

            unsigned long           rss             Number of page frames allocated to the process

                 unsigned long                     anon_rss                Number of page frames assigned to anonymous memory mappings

            unsigned long           total_vm            Size of the process address space (number of pages)

            unsigned long           locked_vm           Number of "locked" pages that cannot be swapped out (see Chapter 17)

                 unsigned long                     shared_vm             Number of pages in shared file memory mappings

                 unsigned long                     exec_vm                Number of pages in executable memory mappings

                 unsigned long                     stack_vm               Number of pages in the User Mode stack

                 unsigned long                     reserved_vm                Number of pages in reserved or special memory regions

            unsigned long           def_flags           Default access flags of the memory regions

                 unsigned long                     nr_ptes                  Number of Page Tables of this process

                 unsigned long[]         saved_auxv          Used when starting the execution of an ELF program (see Chapter 20)

            unsigned int                dumpable            Flag that specifies whether the process can produce a core dump of the memory

            cpumask_t               cpu_vm_mask         Bit mask for lazy TLB switches (see Chapter 2)

            mm_context_t                context         Pointer to table for architecture-specific information (e.g., LDT's address in 80 86                                                                          platforms)

                 unsigned long                     swap_token_time          When this process will become eligible for having the swap

                                                                   token (see the section "The Swap Token" in Chapter 17)

                 char                            recent_pagein         Flag set if a major Page Fault has recently occurred

                 int                         core_waiters                Number of lightweight processes that are dumping the

                                                                   contents of the process address space to a core file (see the

                                                                   section "Deleting a Process Address Space" later in this

                                                                   chapter)

                 struct    completion *                 core_startup_done         Pointer to a completion used when creating a core file (see

                                                                   the section "Completions" in Chapter 5)

                 struct    completion              core_done              Completion used when creating a core file

                 rwlock_t                      ioctx_list_lock               Lock used to protect the list of asynchronous I/O contexts

                                                                   (see Chapter 16)

                 struct kioctx *              ioctx_list                List of asynchronous I/O contexts (see Chapter 16)

                 struct kioctx                       default_kioctx         Default asynchronous I/O context (see Chapter 16)

                 unsigned long                     hiwater_rss             Maximum number of page frames ever owned by the process

                 unsigned long                     hiwater_vm                 Maximum number of pages ever included in the memory regions of the process

 

                 The mm_alloc( ) function is invoked to get a new memory descriptor.

                 the mmput( ) function decreases the mm_users field of a memory descriptor

 

                 9.2.1. Memory Descriptor of Kernel Threads

 

           9.3. Memory Regions

 

                 Linux implements a memory region by means of an object of type vm_area_struct;

                 Table 9-3. The fields of the memory region object

 

                 Type                                 Field                        Description     

 

                 struct mm_struct *          vm_mm               Pointer to the memory descriptor that owns the region

                 unsigned long           vm_start            First linear address inside the region

                 unsigned long           vm_end          First linear address after the region

                 struct vm_area_struct *     vm_next         Next region in the process list.

            pgprot_t                vm_page_prot            Access permissions for the page frames of the region

                 unsigned long           vm_flags            Flags of the region

                 struct rb_node          vm_rb               Data for the red-black tree (see later in this chapter).

                 union                   shared              inks to the data structures used for reverse mapping

                                                                   (see the section "Reverse Mapping for Mapped Pages" in Chapter 17).

 

                 struct list_head             anon_vma_node            Pointers for the list of anonymous memory regions

                 struct anon_vma *              anon_vma              Pointer to the anon_vma data structure

                 struct  vm_operations_struct*   vm_ops              Pointer to the methods of the memory region

                 unsigned long           vm_pgoff            Offset in mapped file (see Chapter 16).For anonymous pages,

                                                                   it is either zero or equal to vm_start/PAGE_SIZE (see Chapter 17).

                 struct file *               vm_file            Pointer to the file object of the mapped file, if any.

                 void *              vm_private_data     Pointer to private data of the memory region.

                 unsigned long                     vm_truncate_count        Used when releasing a linear address interval in a non-linear file memory mapping.

 

                 9.3.1. Memory Region Data Structures

                      Figure 9-2. Descriptors related to the address space of a process

             

     

 

                 9.3.2. Memory Region Access Rights

                 vm_area_struct->vm_flags

 

                 9.3.3. Memory Region Handling

 

                      9.3.3.1. Finding the closest region to a given address: find_vma( )

                      The find_vma( ) function acts on two parameters: the address mm of a process memory descriptor and a linear address addr.

                      It locates the first memory region whose vm_end field is greater than addr and returns the address of its descriptor

 

                      9.3.3.2. Finding a region that overlaps a given interval: find_vma_intersection( )

                            The find_vma_intersection( ) function finds the first memory region that overlaps a given linear

                      address interval; the mm parameter points to the memory descriptor of the process, while the

                      start_addr and end_addr linear addresses specify the interval

 

                      9.3.3.3. Finding a free interval: get_unmapped_area( )

                            The get_unmapped_area( ) function searches the process address space to find an available linear address interval.

                      the function invokes either one of two methods, depending on whether the linear address interval should be used for a file memory

                      mapping or for an anonymous memory mapping

 

                      In the former case, the function executes the get_unmapped_area file operation; this is discussed in Chapter 16.

                      In the latter case, the function executes the get_unmapped_area method of the memory descriptor

                      In turn, this method is implemented by either the arch_get_unmapped_area( ) function, or the

                      arch_get_unmapped_area_topdown( ) function, according to the memory region layout of the process.

 

 

                      9.3.3.4. Inserting a region in the memory descriptor list: insert_vm_struct( )

                            insert_vm_struct( ) inserts a vm_area_struct structure in the memory region object list and redblack

                      tree of a memory descriptor.

 

                     9.3.4. Allocating a Linear Address Interval

                     

                      the do_mmap( ) function creates and initializes a new memory region for the current process..

 

                 9.3.5. Releasing a Linear Address Interval

 

                      When the kernel must delete a linear address interval from the address space of the current process,

                      it uses the do_munmap( ) function

 

                      9.3.5.1. The do_munmap( ) function

                            split_vma( )--->detach_vmas_to_be_unmapped( )--->unmap_region( )

 

                    9.3.5.2. The split_vma( ) function

 

                            9.3.5.3. The unmap_region( ) function

 

           9.4. Page Fault Exception Handler //重点

            Figure 9-4. Overall scheme for the Page Fault handler

 

          

 

 

                 9.4.5. Handling Noncontiguous Memory Area Accesses

 

           9.5. Creating and Deleting a Process Address Space

 

            9.5.1. Creating a Process Address Space

                 9.5.2. Deleting a Process Address Space

 

        9.6. Managing the Heap

 

 

Chapter 10. System Calls

 

        10.1. POSIX APIs and System Calls

            an application programmer interface a function definition that specifies how to obtain a given service.

                 a system call an explicit request to the kernel made via a software interrupt.

          

           10.2. System Call Handler and Service Routines

            Figure 10-1. Invoking a system call

                

              

 

           10.3. Entering and Exiting a System Call

 

                 By executing the int $0x80 assembly language instruction; in older versions of the Linux

                 kernel, this was the only way to switch from User Mode to Kernel Mode

 

                 By executing the sysenter assembly language instruction, introduced in the Intel Pentium II

                 microprocessors; this instruction is now supported by the Linux 2.6 kernel.

 

 

                 By executing the iret assembly language instruction.

 

                 By executing the sysexit assembly language instruction, which was introduced in the Intel

                 Pentium II microprocessors together with the sysenter instruction.

 

                 10.3.1. Issuing a System Call via the int $0x80 Instruction

                      10.3.1.1. The system_call( ) function

                      10.3.1.2. Exiting from the system call

 

                 10.3.2. Issuing a System Call via the sysenter Instruction

 

                      10.3.2.1. The sysenter instruction

                      10.3.2.2. The vsyscall page

                            10.3.2.3. Entering the system call

                      10.3.2.4. Exiting from the system call

                      10.3.2.5. The sysexit instruction

 

 

        10.4. Parameter Passing

            10.4.1. Verifying the Parameters

                 access_ok( )

            verify_area( )

            10.4.2. Accessing the Process Address Space

                 get_user( )/put_user( ).

                 copy_from_user/copy_to_user

 

            10.4.3. Dynamic Address Checking: The Fix-up Code

                 1.

                 The kernel attempts to address a page belonging to the process address space, but either the

                 corresponding page frame does not exist or the kernel tries to write a read-only page. In these

                 cases, the handler must allocate and initialize a new page frame (see the sections "Demand

                 Paging" and "Copy On Write" in Chapter 9).

                     2.

                 The kernel addresses a page belonging to its address space, but the corresponding Page Table

                 entry has not yet been initialized (see the section "Handling Noncontiguous Memory Area

                 Accesses" in Chapter 9). In this case, the kernel must properly set up some entries in the Page

                 Tables of the current process.

                     3.

                 Some kernel functions include a programming bug that causes the exception to be raised when

                 that program is executed; alternatively, the exception might be caused by a transient hardware

                 error. When this occurs, the handler must perform a kernel oops (see the section "Handling a

                 Faulty Address Inside the Address Space" in Chapter 9).

                     4.

                 The case introduced in this chapter: a system call service routine attempts to read or write into

                 a memory area whose address has been passed as a system call parameter, but that address

                 does not belong to the process address space.

 

                 10.4.4. The Exception Tables

                 exception_table_entry->insn   

           The linear address of an instruction that accesses the process address space

                 exception_table_entry->fixup

           The address of the assembly language code to be invoked when a Page Fault exception

                 triggered by the instruction located at insn occurs

                 search_exception_tables( )

                 10.4.5. Generating the Exception Tables and the Fixup Code

 

        10.5. Kernel Wrapper Routines

 

 

Chapter 11. Signals

 

       11.1. The Role of Signals                          

                

           Table 11-1. The first 31 signals in Linux/i386

 

           #Signal name   Default action  Comment                                      POSIX

           1 SIGHUP        Terminate         Hang up controlling terminal or process         Yes

           2 SIGINT        Terminate         Interrupt from keyboard                       Yes

           3 SIGQUIT       Dump              Quit from keyboard                        Yes

           4 SIGILL           Dump               Illegal instruction                            Yes

           5 SIGTRAP       Dump              Breakpoint for debugging                      No

           6 SIGABRT       Dump              Abnormal termination                      Yes

           6 SIGIOT        Dump              Equivalent to SIGABRT                No

           7 SIGBUS         Dump               Bus error                                No

           8 SIGFPE          Dump              Floating-point exception                   Yes

           9 SIGKILL         Terminate         Forced-process termination                   Yes

           10 SIGUSR1            Terminate         Available to processes                     Yes

           11 SIGSEGV            Dump              Invalid memory reference                     Yes

           12 SIGUSR2      Terminate         Available to processes                     Yes

           13 SIGPIPE      Terminate         Write to pipe with no readers                 Yes

           14 SIGALRM      Terminate         Real-timerclock Yes

           15 SIGTERM      Terminate         Process termination                        Yes

           16 SIGSTKFLT        Terminate         Coprocessor stack error                  No

           17 SIGCHLD      Ignore             Child process stopped or terminated,

                                              or got signal if traced                     Yes

           18 SIGCONT           Continue           Resume execution, if stopped                Yes

           19 SIGSTOP      Stop               Stop process execution                   Yes

           20 SIGTSTP            Stop                Stop process issued from tty            Yes

           21 SIGTTIN       Stop                Background process requires input               Yes

           22 SIGTTOU      Stop                Background process requires output        Yes

           23 SIGURG        Ignore              Urgent condition on socket                     No

           24 SIGXCPU            Dump              CPU time limit exceeded                  No

           25 SIGXFSZ      Dump              File size limit exceeded                    No

           26 SIGVTALRM        Terminate         Virtual timer clock                          No

           27 SIGPROF      Terminate         Profile timer clock                          No

           28 SIGWINCH     Ignore              Window resizing                        No

           29 SIGIO        Terminate         I/O now possible                            No

           29 SIGPOLL      Terminate         Equivalent to SIGIO                  No

           30 SIGPWR       Terminate         Power supply failure                             No

           31 SIGSYS       Dump              Bad system call                        No

           31 SIGUNUSED        Dump              Equivalent to SIGSYS                 No

 

           Table 11-2. The most significant system calls related to signals

         System call                                  Description

           kill( )                 Send a signal to a thread group

           tkill( )                               Send a signal to a process

           tgkill( )                              Send a signal to a process in a specific thread group

           sigaction( )                    Change the action associated with a signal

           signal( )                   Similar to sigaction( )

           sigpending( )               Check whether there are pending signals

           sigprocmask( )                   Modify the set of blocked signals

           sigsuspend( )                  Wait for a signal

          

           rt_sigaction( )             Change the action associated with a real-time signal

        rt_sigpending( )                Check whether there are pending real-time signals

        rt_sigprocmask( )               Modify the set of blocked real-time signals

        rt_sigqueueinfo( )              Send a real-time signal to a thread group

        rt_sigsuspend( )                Wait for a real-time signal

        rt_sigtimedwait( )              Similar to rt_sigsuspend( )

 

           11.1.1. Actions Performed upon Delivering a Signal

                 There are three ways in which a process can respond to a signal:

                 1. Explicitly ignore the signal.

                

                 2. Execute the default action associated with the signal (see Table 11-1). This action, which is

                 predefined by the kernel, depends on the signal type and may be any one of the following:

          

                 Terminate

                 The process is terminated (killed).

                 Dump

                 The process is terminated (killed) and a core file containing its execution context is

                 created, if possible; this file may be used for debug purposes

                 Ignore

                 The signal is ignored

                 Stop

                 The process is stoppedi.e., put in the TASK_STOPPED state (see the section "Process State" in Chapter 3).

                 Continue

                 If the process was stopped (TASK_STOPPED), it is put into the TASK_RUNNING state.

 

                 3. Catch the signal by invoking a corresponding signal-handler function.

          

           11.1.2. POSIX Signals and Multithreaded Applications

 

           11.1.3. Data Structures Associated with Signals

                 The fields of the process descriptor related to signal handling are listed in Table 11-3.

                

          

                 Table 11-3. Process descriptor fields related to signal handling

              Type                                Name                        Description

                 structsignal_struct *       signal              Pointer to the process's signal descriptor

                 struct sighand_struct*          sighand            Pointer to the process's signal handler descriptor

                 sigset_t                       blocked                  Mask of blocked signals

                 sigset_t                       real_blocked           Temporary mask of blocked signals (used by the rt_sigtimedwait( ) system call)

                 struct sigpending           pending         Data structure storing the private pending signals

                 unsigned long           sas_ss_sp           Address of alternative signal handler stack

            size_t              sas_ss_size         Size of alternative signal handler stack

                 int (*) (void *)            notifier            Pointer to a function used by a device driver to block some signals of the process

            void *              notifier_data       Pointer to data that might be used by the notifier function (previous field of table)

            sigset_t *              notifier_mask       Bit mask of signals blocked by a device driver through a notifier function

 

                 11.1.3.1. The signal descriptor and the signal handler descriptor

                      Table 11-4. The fields of the signal descriptor related to signal handling

                   Type                         Name               Description

                atomic_t            count           Usage counter of the signal descriptor

                atomic_t            live            Number of live processes in the thread group

                wait_queue_head_t       wait_chldexit   Wait queue for the processes sleeping in a wait4( )system call

                struct task_struct*     curr_target     Descriptor of the last process in the thread group that received a signal

                struct sigpending       shared_pending Data structure storing the shared pending signals

                int             group_exit_code Process termination code for the thread group

                struct task_struct *    group_exit_task Used when killing a whole thread group

                int             notify_count        Used when killing a whole thread group

                int             group_stop_count    Used when stopping a whole thread group

                unsigned int            flags           Flags used when delivering signals that modify the status of the process

                

                      Table 11-5. The fields of the signal handler descriptor

 

                   Type                        Name                 Description

                atomic_t            count           Usage counter of the signal handler descriptor

                structk_sigaction[64]   action      Array of structures specifying the actions to be performed upon delivering the signals

                spinlock_t          siglock     Spin lock protecting both the signal descriptor and the signal handler descriptor

                 11.1.3.2. The sigaction data structure

                      sigaction->,

                      sa_handler

                      This field specifies the type of action to be performed; its value can be a pointer to the signal

                      handler, SIG_DFL (that is, the value 0) to specify that the default action is performed, or

                SIG_IGN (that is, the value 1) to specify that the signal is ignored.

 

                      sa_flags

                      This set of flags specifies how the signal must be handled; some of them are listed in Table 11-6.

 

                      sa_mask

                      This sigset_t variable specifies the signals to be masked when running the signal handler.

                     

                      Table 11-6. Flags specifying how to handle a signal

                      Flag Name              Description

                      SA_NOCLDSTOP            Applies only to SIGCHLD; do not send SIGCHLD to the parent when the processis stopped

                      SA_NOCLDWAIT            Applies only to SIGCHLD; do not create a zombie when the process terminates

                      SA_SIGINFO                 Provide additional information to the signal handler (see the later section "Changing a Signal Action")

                      SA_ONSTACK               Use an alternative stack for the signal handler (see the later section"Catching the Signal")

                      SA_RESTART                Interrupted system calls are automatically restarted (see the later section "Reexecution of System Calls")

                      SA_NODEFER/SA_NOMASK   Do not mask the signal while executing the signal handler

                      SA_RESETHAND/SA_ONESHOT,Reset to default action after executing the signal handler

 

                 11.1.3.3. The pending signal queues

                      struct sigpending {

                            struct list_head list;

                            sigset_t signal;

                            }

                      The signal field is a bit mask specifying the pending signals, while the list field is the head of a

                      doubly linked list containing sigqueue data structures; the fields of this structure are shown in Table 11-7.

                      Table 11-7. The fields of the sigqueue data structure

                   Type                         Name          Description

                struct  list_head       list        Links for the pending signal queue's list

                spinlock_t *            lock        Pointer to the siglock field in the signal handler descriptor corresponding to the pending signal

                int             flags       Flags of the sigqueue data structure

                siginfo_t           info        Describes the event that raised the signal

                struct user_struct *        user       Pointer to the per-user data structure of the process's owner (see the

                                                        section "The clone( ), fork( ), and vfork( ) System Calls" in Chapter 3)

 

                      siginfo_t->

 

                      si_signo

                            The signal number

              si_errno

                            The error code of the instruction that caused the signal to be raised, or 0 if there was no error

              si_code

                      A code identifying who raised the signal (see Table 11-8)

                      _sifields

                      A union storing information depending on the type of signal. For instance, the siginfo_t data

                      structure relative to an occurrence of the SIGKILL signal records the PID and the UID of the

                      sender process here; conversely, the data structure relative to an occurrence of the SIGSEGV

                      signal stores the memory address whose access caused the signal to be raised

 

           11.1.4. Operations on Signal Data Structures

                      set is a pointer to a sigset_t variable, nsig is the number of a signal, and mask is an unsigned long bit mask.

 

                      sigemptyset(set) and sigfillset(set)

                       Sets the bits in the sigset_t variable to 0 or 1, respectively

 

                      sigaddset(set,nsig) and sigdelset(set,nsig)

                      Sets the bit of the sigset_t variable corresponding to signal nsig to 1 or 0, respectively. In

                      practice, sigaddset( ) reduces to:

 

                      sigaddsetmask(set,mask) and sigdelsetmask(set,mask)

              Sets all the bits of the sigset_t variable whose corresponding bits of mask are on 1 or 0,

                      respectively. They can be used only with signals that are between 1 and 32. The corresponding

                      functions reduce to:

 

              sigismember(set,nsig)

              Returns the value of the bit of the sigset_t variable corresponding to the signal nsig

 

              sigmask(nsig)

                      Yields the bit index of the signal nsig. In other words, if the kernel needs to set, clear, or test

                      a bit in an element of sigset_t that corresponds to a particular signal, it can derive the proper

                      bit through this macro

 

                      sigandsets(d,s1,s2), sigorsets(d,s1,s2), and signandsets(d,s1,s2)

                      Performs a logical AND, a logical OR, and a logical NAND, respectively, between the sigset_t

                      variables to which s1 and s2 point; the result is stored in the sigset_t variable to which d points.

 

                      sigtestsetmask(set,mask)

                      Returns the value 1 if any of the bits in the sigset_t variable that correspond to the bits set to

                      1 in mask is set; it returns 0 otherwise. It can be used only with signals that have a number

                      between 1 and 32.

 

                      siginitset(set,mask)

                      Initializes the low bits of the sigset_t variable corresponding to signals between 1 and 32

                      with the bits contained in mask, and clears the bits corresponding to signals between 33 and 63.

 

                      ssiginitsetinv(set,mask)

                      Initializes the low bits of the sigset_t variable corresponding to signals between 1 and 32

                      with the complement of the bits contained in mask, and sets the bits corresponding to signals

                      between 33 and 63.

 

              signal_pending(p)

                      Returns the value 1 (true) if the process identified by the *p process descriptor has nonblocked

                      pending signals, and returns the value 0 (false) if it doesn't. The function is implemented as a

                      simple check on the TIF_SIGPENDING flag of the process.

 

                      recalc_sigpending_tsk(t) and recalc_sigpending( )

                      The first function checks whether there are pending signals either for the process identified by

                      the process descriptor at *t (by looking at the t->pending->signal field) or for the thread

                      group to which the process belongs (by looking at the t->signal->shared_pending->signal

                      field). The function then sets accordingly the TIF_SIGPENDING flag in t->thread_info->flags.

                      The recalc_sigpending( ) function is equivalent to recalc_sigpending_tsk(current).

 

              rm_from_queue(mask,q)

                      Removes from the pending signal queue q the pending signals corresponding to the bit mask mask.

 

              flush_sigqueue(q)

                      Removes from the pending signal queue q all pending signals.

 

              flush_signals(t)

                      Deletes all signals sent to the process identified by the process descriptor at *t. This is done

                       by clearing the TIF_SIGPENDING flag in t->thread_info->flags and invoking twice

                flush_sigqueue( ) on the t->pending and t->signal->shared_pending queues.

 

      11.2. Generating a Signal

 

         Table 11-9. Kernel functions that generate a signal for a process

              Name                                      Description

        send_sig( )                 Sends a signal to a single process

        send_sig_info( )                Like send_sig( ), with extended information in a siginfo_t structure

        force_sig( )                    Sends a signal that cannot be explicitly ignored or blocked by the process

        force_sig_info( )               Like force_sig( ), with extended information in a siginfo_t structure

        force_sig_specific()            Like force_sig( ), but optimized for SIGSTOP and SIGKILL signals

           sys_tkill( )                         System call handler of tkill( ) (see the later section "System Calls Related to Signal Handling")

           sys_tgkill( )                        System call handler of tgkill( )

 

        Table 11-10. Kernel functions that generate a signal for a thread group

 

        Name                                      Description

        send_group_sig_info()           Sends a signal to a single thread group identified by the process descriptor of one of its members

        kill_pg( )                  Sends a signal to all thread groups in a process group (see the section "Process Management" in Chapter 1)

        kill_pg_info( )             Like kill_pg( ), with extended information in a siginfo_t structure

        kill_proc( )                    Sends a signal to a single thread group identified by the PID of one of its members

        kill_proc_info( )               Like kill_proc( ), with extended information in a siginfo_t structure

        sys_kill( )                          System call handler of kill( ) (see the later section "System Calls Related to Signal Handling")

           sys_rt_sigqueueinfo( )               System call handler of rt_sigqueueinfo( )

       

        11.2.1. The specific_send_sig_info( ) Function

 

           11.2.2. The send_signal( ) Function  

       

        11.2.3. The group_send_sig_info( ) Function

 

    11.3. Delivering a Signal

        To handle the nonblocked pending signals, the kernel invokes the do_signal( ) function

        Then do_signal( ) loads the ka local variable with the address of the k_sigaction data structure of

           the signal to be handled:

       ka = &current->sig->action[signr-1];

           Depending on the contents, three kinds of actions may be performed: ignoring the signal, executing

           a default action, or executing a signal handler.

 

        11.3.1. Executing the Default Action for the Signal

        1: SIGSTOP

        2: dump // 这个地方可以研究一下

        The signals whose default action is "dump" may create a core file in the process working directory;

           this file lists the complete contents of the process's address space and CPU registers

   

        11.3.2. Catching the Signal

        handle_signal( ):

        Figure 11-2. Catching a signal

     

 

                 11.3.2.1. Setting up the frame

 

                 11.3.2.2. Evaluating the signal flags

 

                 11.3.2.3. Starting the signal handler

 

                 11.3.2.4. Terminating the signal handler

     

           11.3.3. Reexecution of System Calls

 

                 11.3.3.1. Restarting a system call interrupted by a non-caught signal

 

                     11.3.3.2. Restarting a system call for a caught signal

 

       11.4. System Calls Related to Signal Handling

 

                     11.4.1. The kill( ) System Call

                      kill(pid,sig)

            11.4.2. The tkill( ) and tgkill( ) System Calls

                      tkill( ) and tgkill( )

 

            11.4.3. Changing a Signal Action

 

                sigaction(sig,act,oact)

 

            11.4.4. Examining the Pending Blocked Signals

 

                sigpending( )

 

            11.4.5. Modifying the Set of Blocked Signals

 

                sigprocmask( )

            11.4.6. Suspending the Process

                      sigsuspend( )   

                     

 

 

 

 

 

 

 

431    

 

猜你喜欢

转载自blog.csdn.net/u011961033/article/details/83088743