任务的休眠与唤醒

一、问题

任务的基本状态就是可运行与不可运行，这是一个任务的基本状态，正是运行的任务完成了真正的内核功能，而非运行的任务实现了任务的同步。所以任务的运行与非运行的转换是内核调度的一个基本功能。

二、设置的时机和方式

1、任务的去活跃

从调度的代码中看，一个线程设置为活跃与不活跃的两个最基本的操作分别为activate_task何deactivate_task，这两个函数完成了线程从可运行队列到不可运行队列之间的一个实质性转换。这个实质性的转换有别于通过set_current_state这种表面的标志性操作。例如，当通过set_current_state设置当前线程为TASK_INTERRUPTABLE之后，这个线程还会继续运行，直到在这个线程中运行了schedule函数位置。

现在假设有一个任务觉得自己离开某个条件或者环境就无法运行了，那么它可以简单的通过set_current_state设置自己为非RUNNING状态，然后执行schedule函数，该函数将会对当前执行schedule的任务状态进行特殊处理和实质性判断，这个可以说是set_current_state设置之后最重要的生效时机了。

我们看一下这个函数对于调用线程状态的判断

switch_count = &prev->nivcsw;
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {其中#define TASK_RUNNING  0，其它所有的非零值表示这个任务处于不可运行状态，所以可能就要将他真正的从可运行队列中剔除了。其中的PREEMPT_ACTIVE表示此次抢占是在内核态执行的一次任务抢占，也就是说这个被抢占的任务并没有直接调用这个schedule函数，而是在异常或者中断发生的时候被动调用这个重新调度函数的。这个判断对系统的统计有用。如果不是抢占，那么久表示自愿调度。
  switch_count = &prev->nvcsw;
  if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
    unlikely(signal_pending(prev))))这里判断如果一个任务时可以被信号唤醒的，并且此时它已经有信号到来，则马上唤醒，否则可能会丢失信号。这个情况可能发生在多核中，因为另个CPU中的任务向这个任务发送一个信号，此时由于判断该任务是运行状态，所以不会做唤醒操作。而这个任务之后将自己设置为可中断，然后准备睡眠，此时要再次进行判断，否则可能会丢失信号，造成信号无法唤醒。(在单核下不知道在什么情况下出现这种情况，可能某些中断或者异常中会向当前任务发送信号，例如SIGSEGV，当然这个在正常的内核里是不会出现的)。
   prev->state = TASK_RUNNING; 这里并不会将任务真正睡眠，而是让它继续运行。
  else {
   if (prev->state == TASK_UNINTERRUPTIBLE)
    rq->nr_uninterruptible++;
   deactivate_task(prev, rq); 这里就是真正的要将线程从调度的运行队列中删除了，这个是实质性操作。
  }
}

deactivate_task--->>>dequeue_task

static void dequeue_task(struct task_struct *p, struct prio_array *array)
{
array->nr_active--;
list_del(&p->run_list);这里是一个实质性的删除操作，所有的任务的选择都是通过这个结构来完成的。这个就是将任务p从自己的run_list中删除。此时当遍历run_list的时候就不会找到这个任务。
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
}

顺便看优先级队列

struct prio_array {
unsigned int nr_active;
DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_PRIO];
};

#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO

#define MAX_PRIO (MAX_RT_PRIO + 40)

也就是说系统中共140个优先级，虽然前100个是实时任务，非实时任务一般通过CFS中的红黑树来实现(所以不需要优先级队列)，但是它们同样有自己对应的优先级链表头。

2、激活一个任务

激活一个任务通过activate_task接口来完成，这个接口

static void enqueue_task(struct task_struct *p, struct prio_array *array)
{
sched_info_queued(p);
list_add_tail(&p->run_list, array->queue + p->prio);所有的运行队列中的任务通过任务中的run_list连接在一起。
__set_bit(p->prio, array->bitmap);设置位图。
array->nr_active++;
p->array = array;这里对任务的array进行了赋值，这个值将会在deactivate_task中用到：dequeue_task(p, p->array);。
}

大部分的激活动作都是在try_to_wake_up函数中完成的，所以这里的参数sync的意义不是很清楚，从注释上看是，如果sync为1，标志新唤醒的线程不用抢占当前线程，在《深入理解LInux内核》中也是如此说明的

A flag (sync) that forbids the awakened process to preempt the process currently running on the local CPU

。因为大部分情况下sync都是0，所以新唤醒的任务一般都会进行抢占判断

if (!sync || cpu != this_cpu) {
if (TASK_PREEMPTS_CURR(p, rq))
resched_task(rq->curr);
}

在resched_task中，事实上没有做什么实质性操作，而只是设置了一个标志，标志着在某个时间之后需要抢占，那么具体在什么时候抢占呢？同样是不确定的。大部分发生在中断或者异常返回之后，如果中断或者异常返回之后没有执行，那说明很可能执行了preempt_disable，但是既然执行了disable，就一定会执行enable，在enable的时候会在此判断这个标志，如果线程标志位需要调度，就会执行调度

#define preempt_enable() \
do { \
preempt_enable_no_resched(); \
barrier(); \
preempt_check_resched(); \
} while (0)
#define preempt_check_resched() \
do { \
if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
preempt_schedule(); \
} while (0)

如果一直没有发生中断或者异常，那么不要忘记，从用户态进入内核态就是异常或者中断的一种，所以在返回用户态的时候同样会进行这个判断，从而进行调度。linux-2.6.21\arch\i386\kernel\entry.S

ENTRY(resume_userspace)
  DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt
     # setting need_resched or sigpending
     # between sampling and the iret
movl TI_flags(%ebp), %ecx
andl $_TIF_WORK_MASK, %ecx # is there any work to be done on
     # int/exception return?
jne work_pending
jmp restore_all
END(ret_from_exception)

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
(0x0000FFFF & ~(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SYSCALL_EMU))

也就是说，处理上面列出的标志之外，其它的所有的都会导致在返回用户态之前跳转到work_pending中

work_pending:
testb $_TIF_NEED_RESCHED, %cl
jz work_notifysig
work_resched:
call schedule

这里进行再次调度，所以如果设置了重新调度，那么可能在内核中发生异常或者中断之后，或者在内核preempt_enable的时候，最迟在返回用户态的时候进行调度。

3、调度的选择

array = rq->active;
if (unlikely(!array->nr_active)) {如果说active队列已空，那么切换active和expire队列，这主要是为了满足分时系统中，例如SCHED_RR和CFS调度。
  /*
   * Switch the active and expired arrays.
   */
  schedstat_inc(rq, sched_switch);
  rq->active = rq->expired;
  rq->expired = array;
  array = rq->active;
  rq->expired_timestamp = 0;
  rq->best_expired_prio = MAX_PRIO;
}

idx = sched_find_first_bit(array->bitmap);从队列中找到最高优先级的任务，
queue = array->queue + idx;队列头。
next = list_entry(queue->next, struct task_struct, run_list);队列的第一个元素，可以看到是通过run_list遍历链表。

4、2.6.37的调度器

/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
const struct sched_class *class;
struct task_struct *p;

/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
if (likely(rq->nr_running == rq->cfs.nr_running)) {简单优化，如果所有的都是CFS任务，则直接调用fair_sched_calse的调度，这在桌面系统中是比较常见的情况。
  p = fair_sched_class.pick_next_task(rq);
  if (likely(p))
   return p;
}

for_each_class(class) {否则从不同的调度器开始选择，这样就保证了实时任务总是最早的得到调度。
  p = class->pick_next_task(rq);
  if (p)
   return p;
}

BUG(); /* the idle class will always have a runnable task */
}

各个优先级的遍历，注意，这里是一个循环，也就是说，如果第一个调度器返回为空，那么第二个调度器会被调用，所以高优先级的调度器没有必要来自己调用低优先级的调度器。

#define sched_class_highest (&stop_sched_class)
#define for_each_class(class) \
for (class = sched_class_highest; class; class = class->next)
也就是stop_sched_class是最高优先级的调度器

static const struct sched_class stop_sched_class = {
.next = &rt_sched_class,

static const struct sched_class rt_sched_class = {
.next = &fair_sched_class,

static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,

static const struct sched_class idle_sched_class = {
/* .next is NULL */
/* no enqueue/yield_task for idle tasks */

这是一个静态的链表。

5、时间片实时任务

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_rt(rq);

watchdog(rq, p);

/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
if (p->policy != SCHED_RR)
return;

if (--p->rt.time_slice)
return;

p->rt.time_slice = DEF_TIMESLICE;

/*
* Requeue to the end of queue if we are not the only element
* on the queue:
*/
if (p->rt.run_list.prev != p->rt.run_list.next) {
requeue_task_rt(rq, p, 0);
set_tsk_need_resched(p);
}
}

由于fair是按照时间来分配的，所以在时钟中断来临的时候，它是以事件为单位判断的

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;

ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;这里累加了一个任务的真实运行时间。
if (delta_exec > ideal_runtime) {
  resched_task(rq_of(cfs_rq)->curr);
  /*
   * The current task ran long enough, ensure it doesn't get
   * re-elected due to buddy favours.
   */
  clear_buddies(cfs_rq, curr);
  return;
}

而对于实时任务来说，它由于是不能被抢占的，所以它的循环是通过时钟切换次数来判断的，每次时钟中断到来的时候认为已经完成了一个时间片。而且这个RR只是相同优先级之间的RR，当一个实时线程用完了自己的时间片之后，才会给其他实时任务是用，包括相同优先级的非RR实时任务。

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_rt(rq);

watchdog(rq, p);

/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
if (p->policy != SCHED_RR)
return;

if (--p->rt.time_slice)
return;

p->rt.time_slice = DEF_TIMESLICE;

/*
* Requeue to the end of queue if we are not the only element
* on the queue:
*/
if (p->rt.run_list.prev != p->rt.run_list.next) {
requeue_task_rt(rq, p, 0);
set_tsk_need_resched(p);
}
}

* default timeslice is 100 msecs (used only for SCHED_RR tasks).
* Timeslices get refilled after they expire.
*/
#define DEF_TIMESLICE (100 * HZ / 1000)这是一个频率单位。

任务的休眠与唤醒

猜你喜欢