linux kernel - process

Processes and threads

A process is an abstraction of a running program, including:

  • (independent) address space
  • one or more threads
  • Open file (presented as descriptor fd)
  • socket
  • Semaphore
  • shared memory area
  • timer
  • signal handler signal handler
  • Other resource and status information

These things all exist in the process control block (PCB). In linux, yes struct task_struct.

process resources

When we look in /proc/<pid>the directory, we can see <pid>information about the process with process number.

If a process wants to view its own related information, it can access /proc/selfthe directory.

                +-------------------------------------------------------------------+
                | dr-x------    2 tavi tavi 0  2021 03 14 12:34 .                   |
                | dr-xr-xr-x    6 tavi tavi 0  2021 03 14 12:34 ..                  |
                | lrwx------    1 tavi tavi 64 2021 03 14 12:34 0 -> /dev/pts/4     |
           +--->| lrwx------    1 tavi tavi 64 2021 03 14 12:34 1 -> /dev/pts/4     |
           |    | lrwx------    1 tavi tavi 64 2021 03 14 12:34 2 -> /dev/pts/4     |
           |    | lr-x------    1 tavi tavi 64 2021 03 14 12:34 3 -> /proc/18312/fd |
           |    +-------------------------------------------------------------------+
           |                 +----------------------------------------------------------------+
           |                 | 08048000-0804c000 r-xp 00000000 08:02 16875609 /bin/cat        |
$ ls -1 /proc/self/          | 0804c000-0804d000 rw-p 00003000 08:02 16875609 /bin/cat        |
cmdline    |                 | 0804d000-0806e000 rw-p 0804d000 00:00 0 [heap]                 |
cwd        |                 | ...                                                            |
environ    |    +----------->| b7f46000-b7f49000 rw-p b7f46000 00:00 0                        |
exe        |    |            | b7f59000-b7f5b000 rw-p b7f59000 00:00 0                        |
fd --------+    |            | b7f5b000-b7f77000 r-xp 00000000 08:02 11601524 /lib/ld-2.7.so  |
fdinfo          |            | b7f77000-b7f79000 rw-p 0001b000 08:02 11601524 /lib/ld-2.7.so  |
maps -----------+            | bfa05000-bfa1a000 rw-p bffeb000 00:00 0 [stack]                |
mem                          | ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]                 |
root                         +----------------------------------------------------------------+
stat                 +----------------------------+
statm                |  Name: cat                 |
status ------+       |  State: R (running)        |
task         |       |  Tgid: 18205               |
wchan        +------>|  Pid: 18205                |
                     |  PPid: 18133               |
                     |  Uid: 1000 1000 1000 1000  |
                     |  Gid: 1000 1000 1000 1000  |
                     +----------------------------+

thread

Threads are the basic unit for the kernel to schedule tasks to run on the CPU. A thread has the following properties:

  • Each thread has its own stack and its own register value (used to save which step it has run to)
  • Threads run in the context of a process, and threads in a process share resources.
  • The kernel schedules threads rather than processes. In addition, the kernel is not aware of user-mode threads (such as goroutine in golang).
  • In a classic thread implementation, thread information is treated as a separate data structure (linked list node), which is then linked to the process's data structure. For example, the Windows check thread is implemented as shown in the figure below:
    Insert image description here
    We can see that there is a thread linked list in a process control block, and each linked list element (thread) points to the process to which it belongs.

Linux's implementation of threads is different. The basic unit (of threads and processes) is called task, so the data structure corresponding to processes and threads is struct task_struct, this structure is used to describe processes and threads. In struct task_struct, resources are not recorded, but pointers are used to point to the corresponding resources.

As shown in the figure below, if there are two threads in a process (with the same thread group ID, that is, PID), they will point to the same data structure describing resources (such as open files, address spaces, namespaces). If two threads do not belong to the same process, then the data structures describing the resources they point to must be different.

Generally speaking, PID and TGID are the same. But in theory, the operating system kernel can assign different TGIDs to threads within a process, but this is usually not the case in actual Linux implementations.

Insert image description here

system callclone()

In Linux, when opening a new thread pthread_create()or a new process , system calls fork()are used :clone()

int clone(int (*fn)(void *_Nullable), void *stack, int flags,
                 void *_Nullable arg, ...  /* pid_t *_Nullable parent_tid,
                                              void *_Nullable tls,
                                              pid_t *_Nullable child_tid */ );

It allows the caller to decide which resources can be shared, mainly clone()conveying the following information to the function through a binary mask composed of flags:

  • CLONE_FILES - shares the file descriptor table with the parent process
  • CLONE_VM - shares address space with parent process
  • CLONE_FS - shares file system information (such as root directory, pwd) with the parent process
  • CLONE_NEWNS - Do not share the mount namespace with the parent process, create a new one by yourself
  • CLONE_NEWIPC - Do not share the namespace for inter-process communication (such as System V IPC objects, POSIX message queues, etc.) with the parent process. Open a new one yourself.
  • CLONE_NEWNET - Do not share the network namespace with the parent process

For example, using these three flags: CLONE_FILES | CLONE_VM | CLONE_FS means opening a thread. If they are not used, it means that a process is opened.

Namespaces and container technology

In container technology, cgroups and namespaces are mainly used to isolate resources. For example, if there is no container technology, all processes can be /procseen in the directory. Processes running in a container are not visible (or killable) to other containers.

/*
 * A structure to contain pointers to all per-process
 * namespaces - fs (mount), uts, network, sysvipc, etc.
 *
 * The pid namespace is an exception -- it's accessed using
 * task_active_pid_ns.  The pid namespace here is the
 * namespace that children will use.
 *
 * 'count' is the number of tasks holding a reference.
 * The count for each namespace, then, will be the number
 * of nsproxies pointing to it, not the number of tasks.
 *
 * The nsproxy is shared by tasks which share all namespaces.
 * As soon as a single namespace is cloned or unshared, the
 * nsproxy is copied.
 */
struct nsproxy {
	atomic_t count;
	struct uts_namespace *uts_ns;
	struct ipc_namespace *ipc_ns;
	struct mnt_namespace *mnt_ns;
	struct pid_namespace *pid_ns_for_children;
	struct net 	     *net_ns;
	struct time_namespace *time_ns;
	struct time_namespace *time_ns_for_children;
	struct cgroup_namespace *cgroup_ns;
};

In the process control block,

struct task_struct {
	... ...
        struct fs_struct *fs;
	struct files_struct *files;
	struct nsproxy *nsproxy; // 名称空间指针
	... ...
};

The above is struct nsproxya structure that can be used to separate different types of resources (implemented based on namespace).

Currently, it supports IPC, network (network protocol stack isolation, refer to docker network), cg (computing resource usage isolation, such as CPU ratio and mem upper bound), mount (access file isolation), PID (allowing different Namespaces) Processes can have the same PID), time namespace.

access current process

Accessing the current process is a frequent operation, such as:

  • To open a file, you need to access the corresponding fd
  • To access virtual memory, you need to access the page table of the current process
  • More than 90% of system calls require access to the process control block
  • Access the current macro, which is a global pointer pointing to struct task_structthe structure of the current process, which represents the current process. For example, current ->pidyou can get the pid of the current process and current->commthe name of the current process.

As shown in the figure below, in order to support fast access to the process control block in a multi-core environment, each CPU core has a variable to store the pointer of the control block of the currently running process: Another way to access the structure is to use the current
Insert image description here
macro struct task_struct. The following code shows the details of the current macro being used to access the process control block.

/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __attribute_used__;

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
   return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE – 1));
}

#define current current_thread_info()->task

Process context switch

The following figure shows the process context switching process of the Linux kernel:
Insert image description here
T0 here refers to thread 0, and T1 refers to thread 1.

In the above process, for example, when a user thread calls a system call, it first enters the kernel state, writes the user state CPU context to the thread's own kernel stack, then calls a method, actively gives up the CPU, performs context switching, and switches to another state schedule(). A thread continues to run.

Blocking and waking up tasks (including threading)

Task status

The following figure shows the task state transformation logic.

Insert image description here
The differences between TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE are as follows:

In the TASK_INTERRUPTIBLE state, the thread is waiting for a certain condition to be met, but it can be awakened by a signal interrupt.
When a thread enters the TASK_INTERRUPTIBLE state, it will enter the interruptible waiting queue and wait for the condition to be met.
If a thread receives a signal, such as SIGINT sent by Ctrl+C, it will be awakened from sleep and can then choose how to handle the signal (such as ending the process directly).

In the TASK_UNINTERRUPTIBLE state, the thread is also waiting for the condition to be met, but it cannot be interrupted by a signal.
When a thread enters the TASK_UNINTERRUPTIBLE state, it enters an uninterruptible waiting queue.
This state is usually used for critical operations, such as file system write operations. In this case, even if the thread receives the signal, it cannot be interrupted to ensure the integrity of critical operations.

Block the current thread

Blocking the current thread is an important operation for high performance - while the current thread is waiting for IO operations, run other threads.

In order to complete the blocking step, you need:

  • Set the current thread status to TASK_UINTERRUPTIBLE or TASK_INTERRUPTIBLE
  • Add the thread to the waiting queue
  • Get a schedulable thread from the linux scheduler
  • Switch the context to this thread that can be scheduled and start execution

Wake up a task

We can call wake_upthe function to wake up the thread, which mainly does:

  • Select a thread from the waiting queue
  • Set the thread status to TASK_READY
  • Put the thread in the READY queue of the scheduler
  • In an SMP system, there are more things to consider: each CPU has its own queue, so a series of things such as load balancing and processor affinity need to be considered.
#define wake_up(x)                        __wake_up(x, TASK_NORMAL, 1, NULL)

/**
 * __wake_up - wake up threads blocked on a waitqueue.
 * @wq_head: the waitqueue
 * @mode: which threads
 * @nr_exclusive: how many wake-one or wake-many threads to wake up
 * @key: is directly passed to the wakeup function
 *
 * If this function wakes up a task, it executes a full memory barrier before
 * accessing the task state.
 */
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
               int nr_exclusive, void *key)
{
    __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}

static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
                  int nr_exclusive, int wake_flags, void *key)
{
  unsigned long flags;
  wait_queue_entry_t bookmark;

  bookmark.flags = 0;
  bookmark.private = NULL;
  bookmark.func = NULL;
  INIT_LIST_HEAD(&bookmark.entry);

  do {
          spin_lock_irqsave(&wq_head->lock, flags);
          nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
                                          wake_flags, key, &bookmark);
          spin_unlock_irqrestore(&wq_head->lock, flags);
  } while (bookmark.flags & WQ_FLAG_BOOKMARK);
}

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
                            int nr_exclusive, int wake_flags, void *key,
                  wait_queue_entry_t *bookmark)
{
    wait_queue_entry_t *curr, *next;
    int cnt = 0;

    lockdep_assert_held(&wq_head->lock);

    if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
          curr = list_next_entry(bookmark, entry);

          list_del(&bookmark->entry);
          bookmark->flags = 0;
    } else
          curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);

    if (&curr->entry == &wq_head->head)
          return nr_exclusive;

    list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
          unsigned flags = curr->flags;
          int ret;

          if (flags & WQ_FLAG_BOOKMARK)
                  continue;

          ret = curr->func(curr, mode, wake_flags, key);
          if (ret < 0)
                  break;
          if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
                  break;

          if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
                          (&next->entry != &wq_head->head)) {
                  bookmark->flags = WQ_FLAG_BOOKMARK;
                  list_add_tail(&bookmark->entry, &next->entry);
                  break;
          }
    }

    return nr_exclusive;
}

int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key)
{
    int ret = default_wake_function(wq_entry, mode, sync, key);

    if (ret)
        list_del_init_careful(&wq_entry->entry);

    return ret;
}

int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
                    void *key)
{
    WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
    return try_to_wake_up(curr->private, mode, wake_flags);
}

/**
 * try_to_wake_up - wake up a thread
 * @p: the thread to be awakened
 * @state: the mask of task states that can be woken
 * @wake_flags: wake modifier flags (WF_*)
 *
 * Conceptually does:
 *
 *   If (@state & @p->state) @p->state = TASK_RUNNING.
 *
 * If the task was not queued/runnable, also place it back on a runqueue.
 *
 * This function is atomic against schedule() which would dequeue the task.
 *
 * It issues a full memory barrier before accessing @p->state, see the comment
 * with set_current_state().
 *
 * Uses p->pi_lock to serialize against concurrent wake-ups.
 *
 * Relies on p->pi_lock stabilizing:
 *  - p->sched_class
 *  - p->cpus_ptr
 *  - p->sched_task_group
 * in order to do migration, see its use of select_task_rq()/set_task_cpu().
 *
 * Tries really hard to only take one task_rq(p)->lock for performance.
 * Takes rq->lock in:
 *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;
 *  - ttwu_queue()       -- new rq, for enqueue of the task;
 *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
 *
 * As a consequence we race really badly with just about everything. See the
 * many memory barriers and their comments for details.
 *
 * Return: %true if @p->state changes (an actual wakeup was done),
 *           %false otherwise.
 */
 static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
     ...

Preempt tasks

Non-preemptive mode kernel

  • At each timer interrupt, the kernel checks whether the time slice of the current process has been exhausted.
  • If exhausted, set a specific flag in the interrupt context
  • When the interrupt processing is about to end, the kernel checks this flag and calls the schedule() function as appropriate.
  • In this case, the task is non-preemptible in the kernel (such as when running a system call), so there is no synchronization problem

Preemptive mode kernel

In this case, even if we are running a system call, it may be preempted by other threads. The preemption process requires special synchronization primitives: preempt_disable and preempt_enable.

Disabling preemption and spinlocks: To simplify processing in preemptible kernels, and considering that synchronization mechanisms are still needed in multiprocessor (SMP) cases, the kernel automatically disables preemption when spinlocks are used. A spin lock is a locking mechanism that keeps spinning and waiting when a thread or process attempts to acquire the lock, rather than giving up the CPU execution rights. Therefore, to avoid multi-thread race conditions, the kernel disables preemption to ensure that the currently executing task will not be preempted while the spin lock is held.

Setting flags and re-enabling preemption: If a condition occurs during execution that requires preemption of the current task, such as the time slice of the current task has been exhausted, a flag will be set. The kernel checks this flag when preemption is re-enabled, for example by performing a spin lock unlock operation (spin_unlock()). If preemption is required, the scheduler will be called to select a new task for execution. This means that the kernel will check whether it needs to switch to other tasks when the spin lock is unlocked to ensure fair execution of tasks and allocation of time slices.

process context

We say that the kernel is running in process context if the kernel is running a system call.

In the process context, we can use the current macro to access information about the current process.

In the process context we can sleep() (wait for a specific condition)

In the process context we can access user space (unless we are running in the kernel thread context, in which case user space is not involved)

kernel thread

The kernel core or device driver sometimes needs to perform operations that require blocking (that is, waiting for certain conditions to be met). This may involve waiting for data from a hardware device to be ready, waiting for other kernel threads to complete, etc. Because these operations can cause threads to block, the kernel needs a mechanism to manage these threads so that they can run in a blocking manner.

Kernel threads are a special class of tasks that do not use user space resources. This means that they have no user address space assigned to them, do not open user space files, and do not perform user space related system calls. Kernel threads work entirely in kernel space, and they are mainly used for tasks inside the kernel.

Guess you like

Origin blog.csdn.net/weixin_43466027/article/details/132926750