Kafka's two-level scheduling realizes distributed coordination task assignment Golang version

background

Two-level coordination scheduling architecture based on Kafka message queue

Kafka implements a replication protocol in order to coordinate the work of the internal consumer and kafka connector. The main work is divided into two steps:

  1. Obtain metadata information such as its own topic offset through the worker (consumer or connect), and hand it over to the broker of kafka to complete the leader/follower election
  2. The worker Leader node obtains the partition and member information stored in kafka for secondary allocation, and realizes load balancing allocation combined with specific services

From the function to achieve the upper two-level scheduling, the first-level scheduling is responsible for the election of the leader, and the second-level scheduling is the assignment of the worker nodes to complete the tasks of each member

The main purpose is to learn this architectural design idea, although this scenario is very limited

Distributed coordination design based on message queue

Design of the first-level coordinator: The first-level coordinator mainly refers to the Coordinator part, which performs leader election by recording the metadata information of the members, such as determining who is the leader according to the size of the offset. Second-level coordinator design: second-level coordinator It mainly refers to the task allocation part of the leader. The worker node obtains all the tasks and node information, and can allocate the tasks according to the appropriate algorithm, and finally broadcast it to the message queue.

It is worth learning. Usually in the scenario of kafka, it is quite troublesome to achieve unified scheduling for different businesses, so for example, the assignment of specific tasks is migrated from the architecture, and the broker side is only responsible for the general layer The leader election can be done, and the allocation of specific services is separated from the main business structure and implemented by specific services.

Code

core design

According to the design, we abstract: MemoryQueue, Worker, Coordinator, GroupRequest, GroupResponse, Task, Assignment collection core components

MemoryQueue: Simulate message queue to realize message distribution and act as kafka broker Worker: Task execution and specific business secondary coordination algorithm Coordinator: A coordinator inside the message queue for Leader/Follower election Task assignment result constructed by information and node information GroupRequest: Join the cluster request GroupResponse: Response information

MemoryQueue

core data structure

// MemoryQueue 内存消息队列
type MemoryQueue struct {
	done             chan struct{}
	queue            chan interface{}
	wg               sync.WaitGroup
	coordinator      map[string]*Coordinator
	worker           map[string]*Worker
}

The coordinator is used to identify the coordinator of each Group group, and a distributor is established for each group

Node join cluster request processing

MemoryQueue receives the event type, and then distributes it according to the event type. If it is a GroupRequest event, it is distributed to handleGroupRequest for processing. The handleGroupRequest first obtains the coordinator of the corresponding group, and then sends back to the message queue according to the current information buildGroupResponse

Event distribution processing

func (mq *MemoryQueue) handleEvent(event interface{}) {
	switch event.(type) {
	case GroupRequest:
		request := event.(GroupRequest)
		mq.handleGroupRequest(&request)
	case Task:
		task := event.(Task)
		mq.handleTask(&task)
	default:
		mq.Notify(event)
	}
	mq.wg.Done()
}

Join Group request processing

Among them, the coordnator will call its own getLeaderID method to elect a leader node according to the information of each member in the current group

// getGroupCoordinator 获取指定组的协调器
func (mq *MemoryQueue) getGroupCoordinator(group string) *Coordinator {
	coordinator, ok := mq.coordinator[group]
	if ok {
		return coordinator
	}
	coordinator = NewCoordinator(group)
	mq.coordinator[group] = coordinator
	return coordinator
}

func (mq *MemoryQueue) handleGroupRequest(request *GroupRequest) {
	coordinator := mq.getGroupCoordinator(request.Group)
	exist := coordinator.addMember(request.ID, &request.Metadata)
	// 如果worker之前已经加入该组, 就不做任何操作
	if exist {
		return
	}
	// 重新构建请求信息
	groupResponse := mq.buildGroupResponse(coordinator)
	mq.send(groupResponse)
}

func (mq *MemoryQueue) buildGroupResponse(coordinator *Coordinator) GroupResponse {
	return GroupResponse{
		Tasks:       coordinator.Tasks,
		Group:       coordinator.Group,
		Members:     coordinator.AllMembers(),
		LeaderID:    coordinator.getLeaderID(),
		Generation:  coordinator.Generation,
		Coordinator: coordinator,
	}
}

Coordinator

core data structure

// Coordinator 协调器
type Coordinator struct {
	Group      string
	Generation int
	Members    map[string]*Metadata
	Tasks      []string
	Heartbeats map[string]int64
}

The Coordinator stores the metadata information of each worker node through the Members information, and then the Tasks stores all the tasks of the current group, Heartbeats stores the heartbeat information of the workerd, and Generation is a generational counter, which is incremented every time the node changes.

Election of Leader by offset

The election of the master node is carried out by storing the metadata information of the worker

// getLeaderID 根据当前信息获取leader节点
func (c *Coordinator) getLeaderID() string {
	leaderID, maxOffset := "", 0
	// 这里是通过offset大小来判定,offset大的就是leader, 实际上可能会更加复杂一些
	for wid, metadata := range c.Members {
		if leaderID == "" || metadata.offset() > maxOffset {
			leaderID = wid
			maxOffset = metadata.offset()
		}
	}
	return leaderID
}

Worker

core data structure

// Worker 工作者
type Worker struct {
	ID          string
	Group       string
	Tasks       string
	done        chan struct{}
	queue       *MemoryQueue
	Coordinator *Coordinator
}

The worker node will contain a coordinator information, which is used to send heartbeat information to the node subsequently

distribution request message

The worker receives different event types and processes them according to the type. The handleGroupResponse is responsible for receiving the response information from the server Coordinator, which will contain the leader node and task information. The worker will perform the secondary assignment, and handleAssign will be processed and assigned after the assignment. task information

// Execute 接收到分配的任务进行请求执行
func (w *Worker) Execute(event interface{}) {
	switch event.(type) {
	case GroupResponse:
		response := event.(GroupResponse)
		w.handleGroupResponse(&response)
	case Assignment:
		assign := event.(Assignment)
		w.handleAssign(&assign)
	}
}

GroupResponse performs subsequent business logic based on role type

GroupResponse will divide the nodes into two types: Leader and Follower. After the Leader node receives the GroupResponse, it needs to continue to assign tasks, while the Follower only needs to monitor events and send heartbeats.

func (w *Worker) handleGroupResponse(response *GroupResponse) {
	if w.isLeader(response.LeaderID) {
		w.onLeaderJoin(response)
	} else {
		w.onFollowerJoin(response)
	}
}

Follower node

Follower node sends heartbeat

// onFollowerJoin 当前角色是follower
func (w *Worker) onFollowerJoin(response *GroupResponse) {
	w.Coordinator = response.Coordinator
	go w.heartbeat()
}
// heartbeat 发送心跳
func (w *Worker) heartbeat() {
	// timer := time.NewTimer(time.Second)
	// for {
	// 	select {
	// 	case <-timer.C:
	// 		w.Coordinator.heartbeat(w.ID, time.Now().Unix())
	// 		timer.Reset(time.Second)
	// 	case <-w.done:
	// 		return
	// 	}
	// }
}

Leader node

Here, I divide the scheduling assignment into two steps: 1) Split the task by the number of nodes and the number of tasks 2) Assign the sharded task to each node, and finally send it back to the queue

// onLeaderJoin 当前角色是leader, 执行任务分配并发送mq
func (w *Worker) onLeaderJoin(response *GroupResponse) {
	fmt.Printf("Generation [%d] leaderID [%s]\n", response.Generation, w.ID)
	w.Coordinator = response.Coordinator
	go w.heartbeat()
	// 进行任务分片
	taskSlice := w.performAssign(response)

	// 将任务分配给各个worker
	memerTasks, index := make(map[string][]string), 0
	for _, name := range response.Members {
		memerTasks[name] = taskSlice[index]
		index++
	}

	// 分发请求
	assign := Assignment{LeaderID: w.ID, Generation: response.Generation, result: memerTasks}
	w.queue.send(assign)
}

// performAssign 根据当前成员和任务数
func (w *Worker) performAssign(response *GroupResponse) [][]string {

	perWorker := len(response.Tasks) / len(response.Members)
	leftOver := len(response.Tasks) - len(response.Members)*perWorker

	result := make([][]string, len(response.Members))

	taskIndex, memberTaskCount := 0, 0
	for index := range result {
		if index < leftOver {
			memberTaskCount = perWorker + 1
		} else {
			memberTaskCount = perWorker
		}
		for i := 0; i < memberTaskCount; i++ {
			result[index] = append(result[index], response.Tasks[taskIndex])
			taskIndex++
		}
	}
	

Test Data

Start a queue, then add tasks and workers, and observe the allocation results

	// 构建队列
	queue := NewMemoryQueue(10)
	queue.Start()

	// 发送任务
	queue.send(Task{Name: "test1", Group: "test"})
	queue.send(Task{Name: "test2", Group: "test"})
	queue.send(Task{Name: "test3", Group: "test"})
	queue.send(Task{Name: "test4", Group: "test"})
	queue.send(Task{Name: "test5", Group: "test"})

	// 启动worker, 为每个worker分配不同的offset观察是否能将leader正常分配
	workerOne := NewWorker("test-1", "test", queue)
	workerOne.start(1)
	queue.addWorker(workerOne.ID, workerOne)

	workerTwo := NewWorker("test-2", "test", queue)
	workerTwo.start(2)
	queue.addWorker(workerTwo.ID, workerTwo)

	workerThree := NewWorker("test-3", "test", queue)
	workerThree.start(3)
	queue.addWorker(workerThree.ID, workerThree)

	time.Sleep(time.Second)
	workerThree.stop()
	time.Sleep(time.Second)
	workerTwo.stop()
	time.Sleep(time.Second)
	workerOne.stop()

	queue.Stop()

Running result: First, according to the offset, the final test-3 leader, and then check the task assignment result, there are two nodes with two tasks, one node with one task, and then with the worker's exit, the task will be reassigned again

Generation [1] leaderID [test-1]
Generation [2] leaderID [test-2]
Generation [3] leaderID [test-3]
Generation [1] worker [test-1]  run tasks: [test1||test2||test3||test4||test5]
Generation [1] worker [test-2]  run tasks: []
Generation [1] worker [test-3]  run tasks: []
Generation [2] worker [test-1]  run tasks: [test1||test2||test3]
Generation [2] worker [test-2]  run tasks: [test4||test5]
Generation [2] worker [test-3]  run tasks: []
Generation [3] worker [test-1]  run tasks: [test1||test2]
Generation [3] worker [test-2]  run tasks: [test3||test4]
Generation [3] worker [test-3]  run tasks: [test5]
Generation [4] leaderID [test-2]
Generation [4] worker [test-1]  run tasks: [test1||test2||test3]
Generation [4] worker [test-2]  run tasks: [test4||test5]
Generation [5] leaderID [test-1]
Generation [5] worker [test-1]  run tasks: [test1||test2||test3||test4||test5]

Summarize

In fact, in a distributed scenario, this kind of Leader/Follower election is actually more likely to choose consul, etcd, zk, etc. based on the AP model. The design of this article has a great relationship with Kafka's own business scenarios. If there is time in the future, let’s continue to look at other designs. The design borrowed from kafka connet is here.

To be continued, pay attention to the public number: commoner code farmer

More exciting content can be viewed at www.sreguide.com

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132042&siteId=291194637