基于kernel2.5.43对第一版经典RCU实现的思考

最近在研究RCU机制，想从RCU的历史源头开始深入理解（追溯根源会有意想不到的收获，至少可以从代码演进过程中领略大牛们的思想）。

想到这里，网上也有很多和我一样想法的人士，特别感谢这篇文章：

http://www.wowotech.net/kernel_synchronization/Linux-2-5-43-RCU.html

上述文章讲解很详细，内容都覆盖全面。有兴趣的朋友可以直接阅读。

以下只是我的一些思考。

假设4核系统，那么整个系统有以下地方涉及rcu处理（由于4核并发，下列每行对齐只是好看而已，call_rcu不一定每个核上都有调用）

cpu0	cpu1	cpu2	cpu3
tick中断	tick中断	tick中断	tick中断
-->>rcu_pending
-->>rcu_check_callbacks
taskset	taskset	taskset	taskset
-->>rcu_process_callbacks

call_rcu（x）	call_rcu（x）	call_rcu（x）	call_rcu（x）
-->>list_add_tail(&head->list, &RCU_nxtlist(cpu))

数据结构理解

struct rcu_ctrlblk {
	spinlock_t	mutex;		/* Guard this struct                  */
	long		curbatch;	/* Current batch number.	      */
	long		maxbatch;	/* Max requested batch number.        */
	unsigned long	rcu_cpu_mask; 	/* CPUs that need to switch in order  */
					/* for current batch to proceed.      */
};
struct rcu_data {
	long		qsctr;		 /* User-mode/idle loop etc. */
        long            last_qsctr;	 /* value of qsctr at beginning */
                                         /* of rcu grace period */
        long  	       	batch;           /* Batch # for current RCU batch */
        struct list_head  nxtlist;
        struct list_head  curlist;
} ____cacheline_aligned_in_smp;

rcu_ctrlblk个人认为有点像音乐演奏的指挥家，指挥着不同乐器演奏家（rcu_data），演奏家需要盯着（tick中断）指挥家的节拍不停前进。

rcu_cpu_mask的每一位表示一个rcu_data。

curbatch表示当前正在处理的批次，每次通过一个Grace Period就会前进1。主要是和rcu_data的batch结合处理回调。

个人认为在Grace Period到期前，rcu_data的batch永远比curbatch大1。

maxbatch也是比curbatch大1而已。

batch和maxbatch有可能是一样的。下面代码本cpu启动一个新的Grace Period时，会调用rcu_start_batch更新maxbatch。

/*
		 * start the next batch of callbacks
		 */
		spin_lock(&rcu_ctrlblk.mutex);
		RCU_batch(cpu) = rcu_ctrlblk.curbatch + 1;
		rcu_start_batch(RCU_batch(cpu));
		spin_unlock(&rcu_ctrlblk.mutex);

static void rcu_start_batch(long newbatch)
{
	if (rcu_batch_before(rcu_ctrlblk.maxbatch, newbatch)) {
		rcu_ctrlblk.maxbatch = newbatch;
	}
	if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) ||
	    (rcu_ctrlblk.rcu_cpu_mask != 0)) {
		return;
	}
	rcu_ctrlblk.rcu_cpu_mask = cpu_online_map;
}

rcu_data

每个核上的batch有可能都不一样，最大差值是多少呢？是不是等于核数呢？

新的回调只会插入到nxtlist链表，curlist链表会在Grace Period到期时执行。

要是能测试验证一下就更好了。

Bug：hardirq_count() <= 1，这个判断条件应该是 hardirq_count() <= (1 << HARDIRQ_SHIFT)，在2.5.45版本上修复。由于执行rcu_check_callbacks是在timer的interrupt handler中，因此hardirq_count() <= 1 这个判断条件永远不会成立。

void rcu_check_callbacks(int cpu, int user)
{
	if (user || 
	    (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
		RCU_qsctr(cpu)++;
	tasklet_schedule(&RCU_tasklet(cpu));
}

ChangeLog-2.5.45：
<[email protected]>
	[PATCH] RCU idle detection fix
	
	Patch from Dipankar Sarma <[email protected]>
	
	There is a check in RCU for idle CPUs which signifies quiescent state
	(and hence no reference to RCU protected data) which was broken when
	interrupt counters were changed to use thread_info->preempt_count.
	
	Martin's 32 CPU machine with many idle CPUs was not completing any RCU
	grace period because RCU was forever waiting for idle CPUs to context
	switch.  Had the idle check worked, this would not have happened.  With
	no RCU happening, the dentries were getting "freed" (dentry stats
	showing that) but not getting returned to slab.  This would not show up
	in systems that are generally busy as context switches then would
	happen in all CPUs and the per-CPU quiescent state counter would get
	incremented during context switch.

patch-2.5.45：

 void rcu_check_callbacks(int cpu, int user)
 {
 	if (user || 
-	    (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
+	    (idle_cpu(cpu) && !in_softirq() && 
+				hardirq_count() <= (1 << HARDIRQ_SHIFT)))
 		RCU_qsctr(cpu)++;
 	tasklet_schedule(&RCU_tasklet(cpu));
 }

另外还有以下几篇:

Linux2.6.11版本：classic RCU的实现

http://www.wowotech.net/kernel_synchronization/linux2-6-11-RCU.html

RCU作者经典网页：

http://www2.rdrop.com/users/paulmck/RCU/

基于kernel2.5.43对第一版经典RCU实现的思考

Linux2.6.11版本：classic RCU的实现

猜你喜欢