Linux kernel scheduling policy, priority, scheduling class

1. The Linux kernel supports scheduling policies

  • First-in-first-out scheduling (SCHED_FIFO), no time slice.
  • Round-robin scheduling (SCHED_RR), with time slices.
  • Deadline scheduling strategy (SCHED_DEADLINE).

Different scheduling strategies are used to schedule real-time processes.

Ordinary processes support two scheduling strategies:

  • Standard round-robin time-sharing (SCHED_NORMAL) and SCHED_BATCH schedule normal non-real-time processes.
  • Idle (SCHED_IDLE) calls the idle process when the system is idle.

2. Process priority

Deadline processes have a higher priority than real-time processes, and real-time processes have higher priority than normal processes.

The priority of the restricted process is -1. The real-time priority of the real-time process is 1-99, and the larger the priority value is, the higher the priority is. The static priority of common processes is 100-139. The smaller the priority value, the higher the priority. You can change the priority of common processes by modifying the nice value. The priority is equal to 120 plus the nice value.

prio is the scheduling priority, the smaller the value, the higher the priority; in most cases it is normal_prio.

priority deadline process normal process real-time process
prio normal_prio normal_prio normal_prio
static_prio 0 120 plus the nice value, the smaller the value, the higher the priority. 0
normal_prio -1 static_prio 99 to rt_priority
rt_priority 0 0 1 to 99, the higher the value, the higher the priority

In the task_struct structure, the 4 members are related to the priority as follows:

// include/linux/sched.h
int          prio;
int          static_prio;
int          normal_prio;
unsigned int rt_priority;

3. Fair scheduling CFS and other scheduling

3.1. Scheduling class

There are five types of Linux kernel sched_class schedulers:

  • dl_sched_class: Deadline scheduling class.
  • rt_sched_class: Real-time scheduling class.
  • stop_sched_class: Stop scheduling class.
  • idle_sched_c lass
  • fair_sched_class

Each of these scheduling classes has its own scheduling strategy. Mainly for the convenience of adding new scheduling strategies, the Linux kernel abstracts a scheduling class sched_class. The source code of its debugger type is as follows (kernel/sched/sched.h):

extern const struct sched_class dl_sched_class
extern const struct sched_class rt_sched_class
extern const struct sched_class stop_sched_class
extern const struct sched_class idle_sched_c lass
extern const struct sched_class fair_sched_class
  • The stop_sched_class scheduling class can preempt other processes, and other processes cannot preempt it; the stop scheduling class refers to stopping the processor to do more urgent work, only the migration thread belongs to the stop scheduling class, and each CPU has a migration thread cpu_id.
  • The dl_sched_class scheduling class uses a red-black tree to sort the processes from small to large according to the absolute deadline, and selects the process with the smallest absolute deadline for each scheduling time.
  • rt_sched_class maintains a queue for each scheduling priority, and can quickly find the first non-empty queue through the bitmap bitmap; DECLARE_BITMAP(bitmap,MAX_RT_PRIO+1).

Scheduling priority, ranked from high to low: downtime scheduling > deadline scheduling > real-time scheduling > fair scheduling > idle scheduling.

Modify time slice (default is 5ms): /proc/sys/kernel/sched_rt_timeslice_ms

3.2. Fair scheduling CFS

The fair scheduling class applies a completely fair scheduling algorithm, discards time slices and fixed scheduling cycles, and introduces virtual runtime here. The calculation formula of vruntime is as follows:

Virtual running time (vruntime) = actual running time * nice0 weight value / process weight value.

(kernel/sched/core.c)

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
    
    
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

The completely fair scheduling algorithm uses a red-black tree to sort processes in ascending order of virtual running time, and selects the process with the smallest virtual running time each time it is scheduled.

Process time slice = scheduling period * process weight / sum of weights of all processes in the run queue.

3.3, running queue

Each processor has a run queue, the structure is rq, and the defined global variables are as follows:

(kernel/sched/core.c)

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

rq is a description of the ready queue, which is designed for each CPU ready queue, and the local process is sorted on the local queue.

3.4. Scheduling process

The function that actively schedules the process is schedule(), which will delegate the main work to __schedule() for processing.

(kernel/sched/core.c)

asmlinkage __visible void __sched schedule(void)
{
    
    
	struct task_struct *tsk = current;//获取当前进程

	sched_submit_work(tsk);//防止进程睡眠时发送死锁
	do {
    
    
		preempt_disable();//关闭内核抢占
		__schedule(false);//执行调度的核心函数处理细节
		sched_preempt_enable_no_resched();//开启内核抢占
	} while (need_resched());
	sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);

The main processing of the function __shcedule is as follows:

  1. Call pick_next_task() to pick the next process.
  2. Call context_switch() to switch processes.
/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
    
    
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	schedule_debug(prev, preempt);

	if (sched_feat(HRTICK))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(preempt);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up().
	 *
	 * The membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);

	switch_count = &prev->nivcsw;
	if (!preempt && prev->state) {
    
    
		if (signal_pending_state(prev->state, prev)) {
    
    
			prev->state = TASK_RUNNING;
		} else {
    
    
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

			if (prev->in_iowait) {
    
    
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
		}
		switch_count = &prev->nvcsw;
	}

	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();

	if (likely(prev != next)) {
    
    
		rq->nr_switches++;
		/*
		 * RCU users of rcu_dereference(rq->curr) may not see
		 * changes to task_struct made by pick_next_task().
		 */
		RCU_INIT_POINTER(rq->curr, next);
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
    
    
		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
		rq_unlock_irq(rq, &rf);
	}

	balance_callback(rq);
}


/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
    
    
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {
    
                                    // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {
    
                                            // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {
    
                            // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

(1) Switch the user virtual address space. The ARM64 architecture uses the default switch_mm_irqs_off, and its kernel source code is defined as follows:

// include/linux/mmu_context.h

/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MMU_CONTEXT_H
#define _LINUX_MMU_CONTEXT_H

#include <asm/mmu_context.h>

struct mm_struct;

void use_mm(struct mm_struct *mm);
void unuse_mm(struct mm_struct *mm);

/* Architectures that care about IRQ state in switch_mm can override this. */
#ifndef switch_mm_irqs_off
# define switch_mm_irqs_off switch_mm
#endif

#endif

The kernel source code of the switch_mm function is processed as follows:

// arch/arm64/include/asm/mmu_context.h

static inline void __switch_mm(struct mm_struct *next)
{
    
    
	unsigned int cpu = smp_processor_id();

	/*
	 * init_mm.pgd does not contain any user mappings and it is always
	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
	 */
	if (next == &init_mm) {
    
    
		cpu_set_reserved_ttbr0();
		return;
	}

	check_and_switch_context(next, cpu);
}

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
	  struct task_struct *tsk)
{
    
    
	if (prev != next)
		__switch_mm(next);

	/*
	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
	 * value may have not been initialised yet (activate_mm caller) or the
	 * ASID has changed since the last run (following the context switch
	 * of another thread of the same process).
	 */
	update_saved_ttbr0(tsk, next);
}

(2) To switch registers, the macro switch_to delegates this work to the function __switch_to:

// include/asm-generic/switch_to.h

#ifndef __ASM_GENERIC_SWITCH_TO_H
#define __ASM_GENERIC_SWITCH_TO_H

#include <linux/thread_info.h>

/*
 * Context switching is now performed out-of-line in switch_to.S
 */
extern struct task_struct *__switch_to(struct task_struct *,
				       struct task_struct *);

#define switch_to(prev, next, last)					\
	do {
      
      								\
		((last) = __switch_to((prev), (next)));			\
	} while (0)

#endif /* __ASM_GENERIC_SWITCH_TO_H */
(arch/arm64/kernel/process.c)

/*
 * Thread switching.
 */
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
				struct task_struct *next)
{
    
    
	struct task_struct *last;

	fpsimd_thread_switch(next);
	tls_thread_switch(next);
	hw_breakpoint_thread_switch(next);
	contextidr_thread_switch(next);
	entry_task_switch(next);
	uao_thread_switch(next);
	ptrauth_thread_switch(next);
	ssbs_thread_switch(next);

	/*
	 * Complete any pending TLB or cache maintenance on this CPU in case
	 * the thread migrates to a different CPU.
	 * This full barrier is also required by the membarrier system
	 * call.
	 */
	dsb(ish);

	/* the actual thread switch */
	last = cpu_switch_to(prev, next);

	return last;
}

3.5. Scheduling timing

When to schedule a process:

  • The process actively calls the schedule() function.
  • Schedule periodically, preempt the current process, and force the current process to give up the processor.
  • When a process is awakened, the awakened process may preempt the current process.
  • When a new process is created, the new process may preempt the current process.

(1) Active scheduling:

The process runs in user mode and cannot directly call the schedule() function. It can only enter the kernel mode through a system call. If the system call needs to wait for a resource, such as a mutex or a semaphore, the state of the process will be set to sleep state, and then call the schedule() function to schedule the process.

(2) Periodic scheduling:

The function of periodic scheduling is scheduler_tick(), which calls the task_tick() method of the scheduling class to which the current process belongs.

4. RCU mechanism and memory barrier

(1) RCU (read-copy update) is a synchronization mechanism in Linux, and it is read/copy update.

  • The writer modifies the object process: first copy to generate a copy, then update the copy, and finally replace the old object with the new object. When the writer executes the copy, the reader executes and can read the data.
  • When the writer deletes the object, he must wait for all the visitors who have accessed the object to finish their visits before deleting the object. The time to wait for all readers to finish is called the grace period.
  • RCU readers do not have any synchronization overhead, do not need to acquire any locks, do not need to execute delayed instructions, and do not need to execute memory barriers; but the synchronization overhead of writers is very large, and it is necessary to delay object release time and copy modified objects , the writer must directly use the lock; in a sense, it is also a shortcoming of RCU.

(2) Memory barrier: A sequential method to ensure memory access, which is used to solve the possible reordering of assembly instructions when the compiler is coding.

Because in order to make the compiled program run faster on the CPU, sometimes the optimization result may not meet the requirements of the program; the current CPU adopts a superscalar architecture and lanch technology, which can execute many instructions in parallel with one clock.

  • Memory barriers can be divided into two types: compiler memory barriers and CPU memory barriers.
  • The kernel supports three types of memory barriers: memory-mapped I/O write barriers, compiler barriers, and processor barriers.

insert image description here

Guess you like

Origin blog.csdn.net/Long_xu/article/details/129355104