linux内核-时钟中断

在所有的外部中断中，时钟中断起着特殊的作用，其作用远非单纯的计时所能相比。当然，即使是单纯的计时也已经足够重要了。别的不说，没有正确的时间关系，你用来重建内核的工具make就不能正常运行了，因为make是靠时间标记来确定是否需要重新编译以及链接的。可是时钟中断的重要性还远不止于此。

我们在中断的博客中看到，内核在每次中断（以及系统调用和异常）服务完毕返回用户空间之前都要检查是否需要调度，若有需要就进行进程调度。事实上，调度只有当CPU在内核中运行时才能发生。在进程的博客中，读者将会看到进程调度发生在两种情况下。一种是自愿的，通过像sleep之类的系统调用实现；或者时通过其他系统调用进入内核以后因某种原因受阻需要等待，而自愿让内核调度其他进程先来运行。另一种是强制的，当一个进程连续运行的时候超过一定限度时，内核就会强制地调度其他进程来运行。如果没有了时钟，内核就失去了与实践有关的强制调度的依据和时机，而只能依赖于各个进程的思想觉悟了。试想，如果有一个进程在用户空间中陷入死循环，而在死循环体内也没有作任何系统调用，并且也没有发生外设中断，那么，要是没有时钟中断，整个系统就在原地打转什么事也不能做了。这是因为，在这个情况下永远不会有调度，而死抓住CPU不放的进程则陷在死循环中。退一步讲，即使我们还有其他的准则（例如进程的优先级）来决定是否应该调度，那也得要有中断、异常或系统调用使CPU进入内核运行才能发生调度。而唯一可以预测在一定时间内必会发生的，就是时钟中断。所以，对于像linux这样的分时系统来说，时钟中断是维护生命的必要条件，难怪人么称时钟中断为heart beat，也即心跳。

在初始化阶段，在对外部中断的基础设施，也就是IRQ队列的初始化，以及对调度机制的初始化完成以后，就轮到时钟中断的初始化。请看init/main.c中start_kernel的片段：

	trap_init();
	init_IRQ();
	sched_init();
	time_init();

从这里也可以看出，时钟中断和调度是密切联系在一起的。以前也讲到过，一旦开始有时钟中断就可能要进行调度，所以要先完成对调度机制的初始化，做好准备。函数time_init的代码在arch/i386/kernel/time.c中：


void __init time_init(void)
{
	extern int x86_udelay_tsc;
	
	xtime.tv_sec = get_cmos_time();
	xtime.tv_usec = 0;

/*
 * If we have APM enabled or the CPU clock speed is variable
 * (CPU stops clock on HLT or slows clock to save power)
 * then the TSC timestamps may diverge by up to 1 jiffy from
 * 'real time' but nothing will break.
 * The most frequent case is that the CPU is "woken" from a halt
 * state by the timer interrupt itself, so we get 0 error. In the
 * rare cases where a driver would "wake" the CPU and request a
 * timestamp, the maximum error is < 1 jiffy. But timestamps are
 * still perfectly ordered.
 * Note that the TSC counter will be reset if APM suspends
 * to disk; this won't break the kernel, though, 'cuz we're
 * smart.  See arch/i386/kernel/apm.c.
 */
 	/*
 	 *	Firstly we have to do a CPU check for chips with
 	 * 	a potentially buggy TSC. At this point we haven't run
 	 *	the ident/bugs checks so we must run this hook as it
 	 *	may turn off the TSC flag.
 	 *
 	 *	NOTE: this doesnt yet handle SMP 486 machines where only
 	 *	some CPU's have a TSC. Thats never worked and nobody has
 	 *	moaned if you have the only one in the world - you fix it!
 	 */
 
 	dodgy_tsc();
 	
	if (cpu_has_tsc) {
		unsigned long tsc_quotient = calibrate_tsc();
		if (tsc_quotient) {
			fast_gettimeoffset_quotient = tsc_quotient;
			use_tsc = 1;
			/*
			 *	We could be more selective here I suspect
			 *	and just enable this for the next intel chips ?
			 */
			x86_udelay_tsc = 1;
#ifndef do_gettimeoffset
			do_gettimeoffset = do_fast_gettimeoffset;
#endif
			do_get_fast_time = do_gettimeofday;

			/* report CPU clock rate in Hz.
			 * The formula is (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) =
			 * clock/second. Our precision is about 100 ppm.
			 */
			{	unsigned long eax=0, edx=1000;
				__asm__("divl %2"
		       		:"=a" (cpu_khz), "=d" (edx)
        	       		:"r" (tsc_quotient),
	                	"0" (eax), "1" (edx));
				printk("Detected %lu.%03lu MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000);
			}
		}
	}

#ifdef CONFIG_VISWS
	printk("Starting Cobalt Timer system clock\n");

	/* Set the countdown value */
	co_cpu_write(CO_CPU_TIMEVAL, CO_TIME_HZ/HZ);

	/* Start the timer */
	co_cpu_write(CO_CPU_CTRL, co_cpu_read(CO_CPU_CTRL) | CO_CTRL_TIMERUN);

	/* Enable (unmask) the timer interrupt */
	co_cpu_write(CO_CPU_CTRL, co_cpu_read(CO_CPU_CTRL) & ~CO_CTRL_TIMEMASK);

	/* Wire cpu IDT entry to s/w handler (and Cobalt APIC to IDT) */
	setup_irq(CO_IRQ_TIMER, &irq0);
#else
	setup_irq(0, &irq0);
#endif
}

当我们提及系统时钟时，实际上是指内核中的两个全局变量中的一个。一个是数据结构xtime，其类型为struct timeval，如下：

struct timeval {
	time_t		tv_sec;		/* seconds */
	suseconds_t	tv_usec;	/* microseconds */
};

数据结构中记载的是从历史上某一刻开始的时间的绝对值，其数值来自计算机中一个CMOS晶片，常常称为实时时钟。这块CMOS晶片是由电池供电的，所以即使机器断了垫也还能维持正确的时间。上面的630行就是通过get_cmos_time从CMOS时钟晶片中把当时的实际时间读入xtime，时间的精度为秒。而时钟中断，则是由另一个晶片产生的。

另一个全局变量是个无符号整数，叫jiffies，记录着从开机以来时钟中断的次数。每个jiffy的长度就是时钟中断的周期，有时候也称为一个tick，取决于系统中的一个常数HZ，这个常数定义于include/asm-386/param.h中。以后读者会看到，在内核中jiffies远远比xtime重要，是个经常要用到的变量。

系统中有很多因素会影响到时钟中断在时间上的精确度，所以要通过好多手段来加以校正。在比较新的i386 CPU中（主要是Pentium及以后），还设置了一个特殊的64位寄存器，称为时间印记计数器（time stamp counter）TSC。这个计数器对驱动CPU的时钟脉冲进行计数，例如要是CPU的时钟脉冲频率为500MHz，则TSC的计时精度为2ns。由于TSC是个64位的计数器，其计数要经过连续运行上千年才会溢出。显然，可以利用TSC的读数来改善时钟中断的精度。不过，我们在花泽类并不关心时间的精度，所以跳过了代码中有关的部分，而只关注带有本质性的部分。

读者在中断的博客中看到过setup_irq，可以回过头去看一下。这里的第一个参数为中断请求号，时钟中断的请求号为0,。第二个参数时指向一个irqaction数据结构irq0的指针。irq0也是在time.c中定义的：

static struct irqaction irq0  = { timer_interrupt, SA_INTERRUPT, 0, "timer", NULL, NULL};

可见，时钟中断的服务程序为timer_interrupt；中断请求0为时钟中断专用，因为irq0.flags中标志位SA_SHIRQ为0；而且在执行timer_interrupt的过程中不容许中断，因为标志位SA_INTERRUPT为1。服务程序timer_interrupt的代码在同一个文件中：


/*
 * This is the same as the above, except we _also_ save the current
 * Time Stamp Counter value at the time of the timer interrupt, so that
 * we later on can estimate the time of day more exactly.
 */
static void timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
	int count;

	/*
	 * Here we are in the timer irq handler. We just have irqs locally
	 * disabled but we don't know if the timer_bh is running on the other
	 * CPU. We need to avoid to SMP race with it. NOTE: we don' t need
	 * the irq version of write_lock because as just said we have irq
	 * locally disabled. -arca
	 */
	write_lock(&xtime_lock);

	if (use_tsc)
	{
		/*
		 * It is important that these two operations happen almost at
		 * the same time. We do the RDTSC stuff first, since it's
		 * faster. To avoid any inconsistencies, we need interrupts
		 * disabled locally.
		 */

		/*
		 * Interrupts are just disabled locally since the timer irq
		 * has the SA_INTERRUPT flag set. -arca
		 */
	
		/* read Pentium cycle counter */

		rdtscl(last_tsc_low);

		spin_lock(&i8253_lock);
		outb_p(0x00, 0x43);     /* latch the count ASAP */

		count = inb_p(0x40);    /* read the latched count */
		count |= inb(0x40) << 8;
		spin_unlock(&i8253_lock);

		count = ((LATCH-1) - count) * TICK_SIZE;
		delay_at_last_interrupt = (count + LATCH/2) / LATCH;
	}
 
	do_timer_interrupt(irq, NULL, regs);

	write_unlock(&xtime_lock);

}

在这里我么并不关心多处理器SMP结构，也不关心时间的精度，所以实际上只剩下501行的do_timer_interrupt：

timer_interrupt=>do_timer_interrupt


/*
 * timer_interrupt() needs to keep up the real-time clock,
 * as well as call the "do_timer()" routine every clocktick
 */
static inline void do_timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
#ifdef CONFIG_X86_IO_APIC
	if (timer_ack) {
		/*
		 * Subtle, when I/O APICs are used we have to ack timer IRQ
		 * manually to reset the IRR bit for do_slow_gettimeoffset().
		 * This will also deassert NMI lines for the watchdog if run
		 * on an 82489DX-based system.
		 */
		spin_lock(&i8259A_lock);
		outb(0x0c, 0x20);
		/* Ack the IRQ; AEOI will end it automatically. */
		inb(0x20);
		spin_unlock(&i8259A_lock);
	}
#endif

#ifdef CONFIG_VISWS
	/* Clear the interrupt */
	co_cpu_write(CO_CPU_STAT,co_cpu_read(CO_CPU_STAT) & ~CO_STAT_TIMEINTR);
#endif
	do_timer(regs);
/*
 * In the SMP case we use the local APIC timer interrupt to do the
 * profiling, except when we simulate SMP mode on a uniprocessor
 * system, in that case we have to call the local interrupt handler.
 */
#ifndef CONFIG_X86_LOCAL_APIC
	if (!user_mode(regs))
		x86_do_profile(regs->eip);
#else
	if (!smp_found_config)
		smp_local_timer_interrupt(regs);
#endif

	/*
	 * If we have an externally synchronized Linux clock, then update
	 * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
	 * called as close as possible to 500 ms before the new second starts.
	 */
	if ((time_status & STA_UNSYNC) == 0 &&
	    xtime.tv_sec > last_rtc_update + 660 &&
	    xtime.tv_usec >= 500000 - ((unsigned) tick) / 2 &&
	    xtime.tv_usec <= 500000 + ((unsigned) tick) / 2) {
		if (set_rtc_mmss(xtime.tv_sec) == 0)
			last_rtc_update = xtime.tv_sec;
		else
			last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */
	}
	    
#ifdef CONFIG_MCA
	if( MCA_bus ) {
		/* The PS/2 uses level-triggered interrupts.  You can't
		turn them off, nor would you want to (any attempt to
		enable edge-triggered interrupts usually gets intercepted by a
		special hardware circuit).  Hence we have to acknowledge
		the timer interrupt.  Through some incredibly stupid
		design idea, the reset for IRQ 0 is done by setting the
		high bit of the PPI port B (0x61).  Note that some PS/2s,
		notably the 55SX, work fine if this is removed.  */

		irq = inb_p( 0x61 );	/* read the current state */
		outb_p( irq|0x80, 0x61 );	/* reset the IRQ */
	}
#endif
}

同样，我们在这里并不关心多处理SMP结构中采用APIC时的特殊处理，也不关心SGI工作站（402-405行）和PS/2的micro channel（435-449行）的特殊情况，此外，我们在这里也不关心时钟的精度（420-433行）。

这样，就只剩下两件事。一件事是do_timer，另一件是x86_do_profile。其中x86_do_profile的目的在于积累统计信息，也不是我们关心的重点。最后只剩下do_timer了，那是在kernel/timer.c中：

timer_interrupt=>do_timer_interrupt=>do_timer


void do_timer(struct pt_regs *regs)
{
	(*(unsigned long *)&jiffies)++;
#ifndef CONFIG_SMP
	/* SMP process accounting uses the local APIC timer */

	update_process_times(user_mode(regs));
#endif
	mark_bh(TIMER_BH);
	if (TQ_ACTIVE(tq_timer))
		mark_bh(TQUEUE_BH);
}

这里的第676行使jiffies加1。为什么这里不用简单的jiffies++，而要使用这么一种奇怪的方式呢？这是因为代码的作者要使将递增jiffies的操作在一条指令中实现，成为一个原子的操作。gcc将这条语句翻译成一条对内存单元的INC指令。而若采用jiffies++，则有可能会编译成先将jiffies的内容MOV至寄存器EAX，然后递增，再MOV回去。二者所消耗的CPU时钟周期几乎是相同的，但前者保证了操作的原子性。

函数update_process_times就与进程的调度有关了，我们将在进程调度博客中再来介绍。但是，从函数的名字也可以看出，它处理的是当前进程与时间有关的变量，一方面是为统计的目的，另一方面也是为调度的目的。对用于计时和统计的这些变量的操作可说是时钟中断的前半，可是682行和684行为时钟中断安排的后半和第二职业，却要消费多得多的精力。

我们在前几篇博客中已介绍过中断服务程序的后半，即bh。CPU在从中断返回之前都要检查是否在某个bh队列中还有事情等着要处理。而这里的682行就通过mark_bh将bh_task_vec[TIMER_BH]挂入tasklet_hi_vec的队列中，使CPU在中断返回之前执行与TIMER_BH对应的函数timer_bh，这是事先设置好了的。对此，在kernel/sched.c的sched_init中有三行重要的代码：

	init_bh(TIMER_BH, timer_bh);
	init_bh(TQUEUE_BH, tqueue_bh);
	init_bh(IMMEDIATE_BH, immediate_bh);

这里初始化了三个bh，第一个显然是在每次时钟中断结束之前都要执行的，用来完成逻辑上属于时钟中断服务、但又不是那么紧急，或者可以在更为宽松的环境（开中断）下完成的操作，其相应的函数为timer_bh。而TQUEUE_BH和IMMEDIATE_BH，则又是内核中两项重要的基础设施。我们以前讲过，linux内核中可能的bh的数量是32。读者心里可能已经在想，32个bh够吗？如果需要更多怎么办？还有，更重要地，在实践中常常会有要求让某些操作跟某个已经存在的中断服务动态地挂上钩，使一些操作按运行时的需要挂靠在某种中断或甚至某种其他的事件中。举例来说，如果我们要为一个外部设备写驱动程序，该设备要求每20ms读一次它的状态寄存器，再根据读入的信息进行某些计算，并把计算结果写入它的控制寄存器以驱动一台步进马达，而该设备并不具备产生中断的功能。其实，由于这个外设的控制完全是周期性的，本来就不必使用独立的中断，所需要解决的只是怎样与系统的时钟中断挂上钩。前面讲过，linux系统时钟的频率是由一个常数HZ决定的，通常定义为100，也即每10ms一次时钟中断，跟需要的20ms正好是整数倍关系。所以，如果写个程序，并且能在每次时钟中断中都调用它一次。而在程序中则设置一个计数器，使得每当计数为偶数时就采集数据，为奇数就计算输出。这样就可以解决问题了。可是，怎样让时钟中断每次都来调用它呢？TQUEUE_BH就是为这种需求而设置的。全局变量tq_timer指向一个队列，想要让系统在每次时钟中断时都来调用某个函数（当然是在系统空间），就将其挂入该队列中。而这里的683行则检查tq_timer是否为空。如果不为空就通过mark_bh把bh_task_vec[TQUEUE_BH]也挂入bh_task_vec的队列中，这样内核就会在执行bh时通过tqueue_bh来将该队列中所有的而函数都调用一遍。由此可见，TQUEUE_BH确实是一项很重要的基础设施。除与时钟挂钩的tq_timer队列外，还有其他一些bh和相应的队列，IMMEDIATE_BH是其中之一。有关详情我们将在进程和设备驱动的系列博客中介绍。如果说，时钟中断的前半timer_interrupt和后半timer_bh还是它的正业的话，那么tqueue_bh的执行便是它的第二职业了。

在做好这些准备以后，时钟中断服务的前半就完成了。可是读者在中断的博客中已经看到，CPU在返回途中，却在离开do_IRQ之前，先折入了do_softirq去干它的后半和第二职业。在我们这个情景中，timer_bh肯定会得到执行，而tqueue_bh则在tq_timer队列非空时会得到执行。读者也许会问，既然timer_bh肯定要执行的，为什么不干脆把它也放在do_timer中执行，而要费这些周折呢？首先，前面已经看到，执行timer_interrupt的整个过程中中断是关闭的（见前面的SA_INTERRUPT标志位）；而timer_bh的执行则没有这么严格的要求。其次，在do_IRQ的代码中可以看出，对具体中断服务程序的执行与对do_IRQ的执行不是一对一的关系。对具体中断服务程序的执行是在一个循环中进行的，而do_softirq只执行一次。这样，当同一中断通道内紧接着发生了好几次中断时，对do_softirq，从而对timer_bh的执行就推迟并且合并了。

与TIMER_BH对应的timer_bh的代码如下：

void timer_bh(void)
{
	update_times();
	run_timer_list();
}

先看update_times：

timer_bh=>update_times


static inline void update_times(void)
{
	unsigned long ticks;

	/*
	 * update_times() is run from the raw timer_bh handler so we
	 * just know that the irqs are locally enabled and so we don't
	 * need to save/restore the flags of the local CPU here. -arca
	 */
	write_lock_irq(&xtime_lock);

	ticks = jiffies - wall_jiffies;
	if (ticks) {
		wall_jiffies += ticks;
		update_wall_time(ticks);
	}
	write_unlock_irq(&xtime_lock);
	calc_load(ticks);
}

这里做了两件事。第一件事实update_wall_time，目的是处理所谓实时时钟或者说挂钟xtime中的数值，包括计数，进位，以及为精度目的而作的校正。所涉及的主要也是数值的计算和处理，我们就不深入进去了。这里的wall_jiffies也像jiffies一样是个全局变量，它代表着与当前xtime中数值相对应的jiffies值，表示挂钟当前的读数已经校准到了时轴上的那一点。

第二件事是calc_load，目的是计算和积累关于CPU负荷的统计信息。内核每隔5秒计算、累计和更新一次系统在过去的15分钟、10分钟、以及1分钟内平均有多少个进程处于可执行状态，作为衡量系统负荷轻重的指标。由于涉及的主要是数值计算，所以我们也不深入进去了。

从update_times返回后，就是timer_bh的主体部分run_timer_list了。它检查系统中已经设置的各个定时器（timer），如果某个定时器已经到点就执行为之预定的函数（这就是该定时器的bh函数）。我们将在进程与进程调度博客中讲述定时器的设置，到那时再回过来阅读run_timer_list的代码。

每个定时器都由一个timer_list数据结构代表，定义如下：

struct timer_list {
	struct list_head list;
	unsigned long expires;
	unsigned long data;
	void (*function)(unsigned long);
};

这是一个用于链表的数据结构，链表的长度是动态的而不受限制，因此系统中可以设置的定时器数量不受限制（早期的实现是采用数组，因而受到数组大小的限制）。每个定时器都由一个到点时间expires。结构中的函数指针function指向预定在到点执行的bh哈数，并且可以到一个参数data（早期的实现中不能带参数）。如前所述，在执行bh函数时中断时打开的。

可见，在整个时钟中断服务的期间，大部分的操作时在后半，即bh函数中完成的。真正在关中断状态下执行的只是少量关键性的操作，而大量的操作尽可能放在比较宽松的环境下，即开中断的条件下，以及允许在时间上有所伸缩的条件下完成，这样才能将对系统的影响减至最小。一方面，这应该成为系统程序设计（特别是设备驱动程序）的一项准则；而另一方面，这也对设计和开发的人员提出了很高的要求，因为要区分一项操作是否必须在前半中执行，以及是否必须关中断，需要对系统有深刻的理解。

linux内核-时钟中断

猜你喜欢