The principle and implementation of Kafka time wheel

Heroine declaration

As a distributed stream processing platform that supports real-time processing of a large number of requests, Kafka needs a well-designed timer to handle asynchronous tasks. The author of this article will introduce the basic data structure of the timer in Kafka-the principle and implementation of the time wheel based on the source code of Kafka version 1.1.0.

PS: Rich first-line technologies and diversified forms of expression are all in the " 360 Cloud Computing ", please pay attention!

Simple time wheel

The simple time wheel is a circular linked list of time task buckets, also known as buckets . Let u be the time unit size, a time wheel of size n has n buckets and can hold n * u timed tasks, and the expiration time of each task will fall within a time interval. (Note: U and n below follow this definition)

Each bucket holds timed tasks that enter the corresponding time range. The first bucket holds tasks in the range [0, u), the second bucket holds tasks in the range [u, 2u)... the nth bucket holds [u * (n-1), u * n) Range of tasks. After each time unit u, the timer will advance and move to the next bucket, and then all the timed tasks in the first bucket will expire. Since the task has expired, the timer will not insert the task into the current bucket at this time. The timer will run expired tasks immediately. Because the empty bucket is available in the next round, if the current bucket corresponds to time t, it will become a bucket of [t + u * n, t + (n + 1) * u) after advancing.

In essence, the time wheel is a hash table, and the task's expiration time is hashed to the corresponding position. The bucket corresponding to each position is a linked list, so the time complexity of the time wheel insert/delete timing task is O(1). The time complexity of inserting/deleting timers based on priority queues, such as java.util.concurrent.DelayQueue and java.util.Timer, is O(log n).

Hierarchical time wheel

The main disadvantage of the simple time wheel is that it assumes that the timer request is within the n * u time interval from the current moment. If the timer request exceeds this interval, an overflow will occur, causing the task to be unable to be placed in the time wheel. The hierarchical time wheel will deal with this kind of overflow. It organizes the time wheel in layers. The bottom layer has a higher accuracy, and the higher the number of layers, the lower the accuracy of the representation. Precision is used here to refer to the time unit size.

For example, let u = 1, n = 3, and set the starting time to be c, then the buckets of each level are

level	barrel	accuracy
1	[c,c] [c+1,c+1] [c+2,c+2]	1
2	[c,c+2] [c+3,c+5] [c+6,c+8]	3
3	[c,c+8] [c+9,c+17] [c+18,c+26]	9

PS: The expression in the code comments is used here, that is, the closed interval, while the previous principles are all left-closed and right-opened intervals. The two are equivalent, but they are inconsistent.

At c+1, the buckets [c,c], [c,c+2], and [c,c+8] expired, afterwards:

The clock of layer 1 is moved to c+1, and a new bucket [c+3,c+3] is created;
The clocks of layers 2 and 3 are still at c because they have not completely expired.

At this time, the buckets at each level are:

level	barrel	accuracy
1	[c+1,c+1] [c+2,c+2] [c+3,c+3]	1
2	[c,c+2] [c+3,c+5] [c+6,c+8]	3
3	[c,c+8] [c+9,c+17] [c+18,c+26]	9

Note that the bucket [c,c+2] will not receive any tasks, because the time is c+1 at this time, and only the expiration time c+1 and c+2 will be assigned to the bucket, but the two in the first layer Bucket [c+1,c+1] [c+2,c+2] will receive tasks first. Similarly, [c+1,c+8] at level 3 will not receive any tasks, because this range is covered by buckets at level 2.

For a single-layer time wheel, the time complexity of inserting/deleting timing tasks is O(1). For hierarchical time rounds, let m be the number of time rounds, then the time complexity of insertion is O(m), because at most m times are inserted upwards. Compared to the number of requests in the system, m is usually much smaller. The time complexity of deletion is O(1).

A clock is a typical three-layer time wheel, the second hand can indicate 0 to 59 seconds, but for more than 60 seconds, the minute hand needs to be further indicated, and then the instant hand is further indicated. The total time range that can be displayed is 0 to 43199 seconds, with an accuracy of 1. second. From the second hand to the minute hand to the hour hand, it means that the precision is decreasing in sequence. The precision of the second hand is 1 second and there are 60 divisions. Therefore, the precision of the minute hand is 1 * 60 = 60 seconds. Similarly, the precision of the clock is 3600 seconds.

Implementation of TimingWheel

After understanding the concept of a hierarchical time wheel, it is easy to read the code and implement it. The Kafka time wheel is of the TimingWheel class, located in the kafka.utils.timer package.

Internal field

name	Types of	Description
tickMs	Long	Time unit u
wheelSize	Int	Number of buckets n
startMs	Long	Millisecond timestamp
taskCounter	AtomicInteger	The number of tasks, that is, the sum of the number of nodes in all buckets
queue	DelayQueue[TimerTaskList]	Delay queue of the standard library

The following private fields can be calculated through the above main construction parameters (private[this], which can be accessed by other classes in the package)

  // 当前时间轮的整个时间跨度，即更高一层时间轮的 tickMs  private[this] val interval = tickMs * wheelSize  // 创建 wheelSize 个桶（定时任务链表）  private[this] val buckets = Array.tabulate[TimerTaskList](wheelSize) { _ => new TimerTaskList(taskCounter) }
  // 向下取整，使起始时间戳能被 tickMs 整除  private[this] var currentTime = startMs - (startMs % tickMs) // rounding down to multiple of tickMs
  // 高一层时间轮，用来保存超过 interval 的任务  @volatile private[this] var overflowWheel: TimingWheel = null

Create a higher time wheel through addOverflowWheel:

  private[this] def addOverflowWheel(): Unit = {    synchronized {      if (overflowWheel == null) {  // 双重检查上锁        overflowWheel = new TimingWheel(          // 仅有 tickMs 不是原封不动地转发低层时间轮的字段，因为高层时间轮的时间单元粒度更粗（即精度更低）          // 还是参考时钟，时针的 tickMs 是分针 tickMs 的 60 倍          tickMs = interval,          wheelSize = wheelSize,          startMs = currentTime,          taskCounter = taskCounter,          queue        )      }    }  }

添加定时任务

在 Kafka 中，定时任务被抽象为 TimerTaskEntry 类，而桶（定时任务链表）则被抽象为 TimerTaskList 类，在代码中命名都是 bucket（桶）。bucket 实现了 java.util.concurrent.Delayed 接口：

  def getDelay(unit: TimeUnit): Long = {    unit.convert(max(getExpiration - Time.SYSTEM.hiResClockMs, 0), TimeUnit.MILLISECONDS)  }

因此 bucket 能够被加入延时队列中，延时队列在调用 poll 时，会调用内部对象的 getDelay 方法来判断对象是否可以被弹出。再看看实际的 add 实现：

  def add(timerTaskEntry: TimerTaskEntry): Boolean = {    // 定时任务的过期时间戳    val expiration = timerTaskEntry.expirationMs
    if (timerTaskEntry.cancelled) {      // Entry 绑定的 TimerTask 调用了 cancel() 方法主动将 Entry 从链表中移除      false    } else if (expiration < currentTime + tickMs) {      // 过期时间在第一个桶的范围内，表示已经过期，此时无需加入时间轮      false    } else if (expiration < currentTime + interval) {      // 过期时间在当前时间轮能表示的时间范围内，加入到其中一个桶      // 注意按照这个算法，第一个桶的时间范围是 [c+u,c+u*2)，因为 [c,c+u) 范围内被视为已过期      // 而且第一个桶对应 buckets 的下标并不一定是 0，因为数组只是作为循环队列的存储方式，起始下标无所谓      val virtualId = expiration / tickMs      val bucket = buckets((virtualId % wheelSize.toLong).toInt)      bucket.add(timerTaskEntry)
      // 设置过期时间，这里也取整了，即可以被 tickMs 整除      if (bucket.setExpiration(virtualId * tickMs)) { // 仅在新的过期时间和之前的不同才返回 true        // 由于进行了取整，同一个 bucket 所有节点的过期时间都相同，因此仅在 bucket 的第一个节点加入时才会进入此 if 块        // 因此保证了每个桶只会被加入一次到 queue 中，queue 存放所有包含定时任务节点的 bucket        // 借助 DelayQueue 来检测 bucket 是否过期，bucket 时遍历即可取出所有节点        queue.offer(bucket)      }      true    } else {      // 过期时间在当前时间轮表示的范围之外，即溢出，需要创建高一层时间轮来加入      if (overflowWheel == null) addOverflowWheel() // 双重检查上锁的第一层检查      overflowWheel.add(timerTaskEntry) // 注意高一层时间轮也可能无法容纳，因此可能会递归创建更高层级的时间轮    }  }

可以看到 DelayQueue 对象 queue 在时间轮的作用是，保存包含定时任务节点的桶，桶可以来自不同层次的时间轮，当然，所有层次时间轮也共享这个队列。

TimingWheel itself does not implement the advancing function, but uses the delay queue DelayQueue to realize the passage of time. Assuming that there are M timing tasks distributed in N buckets, the time complexity of insertion is O(M + N * log N), where M >= N. If all tasks are stored in the delay queue, the time complexity of insertion is O(M * log M), so the optimization of Kafka's time wheel is meaningful.

Advancement of the time wheel

  def advanceClock(timeMs: Long): Unit = {    if (timeMs >= currentTime + tickMs) { // timeMs 超过了当前 bucket 的时间范围      currentTime = timeMs - (timeMs % tickMs) // 修改当前时间，即原先的第一个桶已经失效
      // 若存在更高层的时间轮，则也会向前运转      if (overflowWheel != null) overflowWheel.advanceClock(currentTime)    }  }

Just modify the currentTime, this field determines whether the internal bucket expires, see the implementation of the add method above.

The role of the time wheel in Kafka management timing tasks

Kafka uses the DelayedOperationPurgatory (hereinafter referred to as purgatory) class under the kafka.server package to manage asynchronous tasks (that is, DelayedOperation). Every time Kafka receives a request, it will start an asynchronous task. If it cannot be completed immediately (for example, a Produce request with acks set to all), it will be thrown to the purgatory for storage (that is, inserted into the internal time wheel). Purgatory will run an ExpiredOperationReaper background thread to detect and process expired asynchronous tasks. In the thread function, it will repeatedly call the advanceClock method of the internal timer object timeoutTimer to advance forward. If there is an expired task, it will be removed from the timer. Remove and execute the callback:

  private class ExpiredOperationReaper extends ShutdownableThread(/* ... */) {// The doWork method will be called cyclically in the thread function, that is, the run method of the base class Thread override def doWork() {advanceClock(200L) // 200 ms}}

Purgatory uses the SystemTimer class under the kafka.utils.timer package as a timer:

  def apply[T <: DelayedOperation](purgatoryName: String, /* ... */): DelayedOperationPurgatory[T] = {    val timer = new SystemTimer(purgatoryName)    new DelayedOperationPurgatory[T](purgatoryName, timer, /* ... */)  }

In SystemTimer, the key field is the timingWheel:

  // The delay queue provided by the java.util.concurrent package private[this] val delayQueue = new DelayQueue[TimerTaskList]() private[this] val taskCounter = new AtomicInteger(0) // Kafka is in the kafka.utils.timer package Self-implemented time wheel private[this] val timingWheel = new TimingWheel( tickMs = tickMs, wheelSize = wheelSize, startMs = startMs, taskCounter = taskCounter, delayQueue)

The advanceClock method actually calls the timingWheel.advanceClock method:

  def advanceClock(timeoutMs: Long): Boolean = {// Wait for timeout milliseconds from the delay queue, if there is an expired bucket, take out var bucket = delayQueue.poll(timeoutMs, TimeUnit.MILLISECONDS) if (bucket != null) {/ / There is an expired bucket writeLock.lock() try {while (bucket != null) {// Advance the current time wheel, the internal may recursively advance a higher time wheel, currentTime is modified timingWheel.advanceClock(bucket.getExpiration() ) bucket.flush(reinsert) // The default timeout is 0, which means that it is non-blocking, which means that as much as possible, take out all expired buckets at the current moment bucket = delayQueue.poll()}} finally {writeLock.unlock()} true} else {false}}

It can be seen that when the SystemTimer object calls advancedClock to advance the time, it actually takes out all expired buckets during the advancement time from the delay queue, and then flushes:

  // Remove all task entries and apply the supplied function to each of them  def flush(f: (TimerTaskEntry)=>Unit): Unit = {    synchronized {      // 遍历整个 bucket（链表），remove 删除所有节点      var head = root.next      while (head ne root) {        remove(head)        f(head)        head = root.next      }      expiration.set(-1L)    }  }

Note that the reinsert function is passed to flush:

private[this] val reinsert = (timerTaskEntry: TimerTaskEntry) => addTimerTaskEntry(timerTaskEntry)

The question is, why do I need to reinsert it after deleting it? Because if the bucket taken out belongs to the high-level time wheel, the bucket may not have expired at this time because the accuracy of the high-level time wheel is not enough. Give an example of a two-layer time wheel (unit: milliseconds):

level	barrel
1	[0,1) [1,2)
2	[0,2) [2,4)

In the initial state, the task with a delay of 3 is added to [2,4), after calling advanceClock(2), the time wheel becomes:

level	barrel
1	[2,3) [3,4)
2	[2,4) [4,6)

Layer 2 [2,4) is taken out, and then the task with a delay of 3 is taken out. At this time, calling reinsert will add it to layer 1 [3,4) instead of judging that it is expired immediately. The downgrade from the high-level time wheel to the low-level time wheel is hidden in this inconspicuous bucket.flush(reinsert).

to sum up

This article describes the concepts of simple time wheel and hierarchical time wheel, and then explains why and how to implement hierarchical time wheel in Kafka through source code reading. For a large number of requests, each request corresponds to a timing task, which requires a large number of insert/delete operations. Therefore, a multi-layer time wheel is used to reduce the time complexity of insert/delete. In order to avoid duplication of wheels, Kafka still uses Java The delay queue of the standard library advances the time round. In addition to learning the time wheel, Kafka's implementation of the time wheel also gave us another inspiration: optimization should be optimized in performance-sensitive areas, and for performance-insensitive operations, if you can use ready-made wheels, don’t bother to reinvent the wheel yourself. .

If you have any suggestions or questions, you can leave a message below.

The principle and implementation of Kafka time wheel

Guess you like