CLP Jinxin: Technical practice|Flink multithreading realizes dynamic load balancing of heterogeneous clusters

Introduction: Apache Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. This article mainly starts with actual cases and combines the author's practical experience to share with readers how to achieve dynamic load balancing of heterogeneous clusters through Flink's custom multi-threading when heterogeneous clusters cannot achieve load balancing in application scenarios.

● 1. Preface

● 2. Problems and solutions

 ● 2.1 Something went wrong

 ● 2.2 Analysis ideas

 ● 2.3 Solutions

● 3. Technical Architecture

● 4. Construction effectiveness

● 5. Conclusion

foreword

In real-time computing application scenarios, there are often real-time call requirements for heterogeneous clusters. When the services of heterogeneous clusters cannot achieve load balancing due to machine configuration, node load, etc., they can be realized through Flink's custom multi-threading. Dynamic load balancing for heterogeneous clusters.           

Here is an example:

Feature production requirements such as text content identification, image content identification, and image OCR all need to interact with heterogeneous clusters based on GPU deployment. If the configuration of GPU cluster machines cannot be unified, load imbalance will occur. That is: some nodes in a GPU cluster process fast, and some nodes process slowly, and the slow processing nodes often cause a large number of timeout exceptions, thereby causing the back pressure of the entire job.

Its flow chart is as follows:

With the help of Flink's distributed inherent advantages, we call the model service through Thrift RPC in the task, obtain the result in real time and then write it to the feature engineering, so as to build the feature and generate the entire link.

picture

Problems and Solutions

problem appear:

The several nodes deployed in the model service are completely independent from each other, and each node will update its status to ZooKeeper after going online/offline. Each subTask of a Flink task will register a Watch to get all the latest and available nodes. For each piece of data flowing into the subTask, a node needs to be selected to complete the reasoning of the data.

In the early stage, we used the Random strategy to select nodes, but in the process of use, we found that if the performance of a model node on the server side becomes low, the time-consuming data reasoning will follow, which may eventually lead to Flink tasks. back pressure. From the perspective of the server, many model nodes are not running at full capacity, but the performance of the server reflected by the client is not enough, and the total QPS of the processing is very low.

In addition, when communicating with the model service, we use a synchronization strategy. For some tasks that take a long time to infer and have a high QPS, a large enough concurrency is required to complete the data request. However, the resource utilization of these tasks is low, which is also a major pain point in the production environment.

picture

analysis of idea:

Based on the above figure, we make the following analysis:

In an ideal situation, we assume that the server has Node 1, Node 2, and Node 3, and the performance of the three nodes is the same, and each node has 32 parallel processing. Assuming that the processing time of each piece of data is 800ms, then the processing capacity of each node should be 40 pieces/s, and the processing capacity should be 120 pieces/s when the three nodes are fully loaded.

In the actual production environment, the processing capabilities of each node of the machine deployed on the server are different, and there are three main reasons for the difference.

 GPU physical machines have various specifications, and the performance varies greatly. It is difficult to ensure that nodes are deployed on the same batch of machines during deployment;

 Mixing multiple model services on one machine will affect each other;

 Failures of the machine network and disk where some nodes are located will also cause differences.

For example, nodes 1 and 2 are deployed on high-performance machines, the node parallelism is 32, and the processing time for a single piece of data is 800ms. Node 3 is deployed on a low-performance machine, the node parallelism is 32, and a single data processing takes 2400ms, so the processing capability of node 3 can be regarded as 13.3 records/s. The strategy of randomly selecting nodes is also adopted. If a total of 40 pieces of data are sent in one second, the performance bottleneck has already been reached for node 3. Assuming that more data is selected to be processed by node 3 at this time, it can only be queued in the queue on the server side, and if the queue is full, the connection will be rejected.

As the task runs, the waiting queue data of node 3 will increase, and the time spent by the client from sending the request to returning the result will also increase accordingly. At this time, once a subTask selects this node, the subTask needs to wait for a long time to complete this request. During this process, if the upstream data continues to flow in, the InputChannel of the subTask will be exhausted gradually, and then the space of the public Buffer Pool will be full, causing subTask2 to be stuck. And because the upstream operator uses Rebalance, the entire Flink task will eventually be stuck.

picture

This is a typical barrel effect. In practical application scenarios, once a single node has poor performance or fails, it will affect the stability of the entire task.

solution:

After analyzing the cause of the accident, we propose to configure a weight for each node, and the model nodes regularly report the weight value to ZooKeeper, and the client assigns corresponding traffic to each node through the weight. This idea is very good, and it has achieved certain results in practice, but it has a small problem: doing so will cause the traffic of each node to jitter all the time, and the frequency of jitter is positively correlated with the time to report the weight, and from From the monitoring point of view, doing so will also cause the total amount of data processed to be unstable.

picture

picture

After the problem was initially alleviated, we began to think about whether there were other better ways to solve this problem. It is known that if the model node is slow, a subTask will be stuck, and this subTask is essentially a slot, that is, a thread, so can we use multi-threading to solve this problem? With this in mind, we tried two other solutions: Async I/O and multithreading.

■ Async I/O solution

Flink introduced Async I/O in version 1.2. Its main purpose is to solve the problem that network delay becomes a system bottleneck when interacting with external systems. Through the exposed API, we can set the maximum number of operations, which is simply understood as the maximum number of concurrent asynchronous requests in Slot. During the test, we prepared three nodes, one node took 2000ms to process, and the other two nodes took 500ms to process.

The task runs normally within a few minutes of startup, and then the server queue accumulation length of the node with slower processing speed becomes larger and larger, and finally stabilizes at around 150. At the same time, the timeout failure rate also begins to increase. high. We can see that although this method can solve the problem of back pressure stuck and resource utilization caused by fast and slow nodes, it still cannot solve the problem of traffic distribution.

picture

picture

■ Multi-threaded solution

We implement a producer-consumer model inside each Slot, and then create the same number of threads as the model nodes, so that each thread requests a fixed node. Even if this node is stuck or the processing speed is slow, it will only affect the current Threads have limited impact on the entire subTask.

picture

As shown in the figure above, a Slot contains multiple threads, and problems with the Service nodes corresponding to individual threads will not affect the consumption of other threads. In this way, an adaptive traffic distribution strategy can be realized. Each thread corresponds to a server Pod. This thread adaptive blocking method can achieve the purpose of less consumption by slow nodes and more consumption by fast nodes.

When Flink takes Slot as the minimum resource granularity and then refines it, starting several threads from the Slot can increase the concurrency, thereby reducing the overall number of Slots, and improving resource utilization while reducing resources.

Technology Architecture

In the Flink task, a multi-threaded solution is used to solve the problem of RPC communication load balancing. The programming model needs to be modified accordingly. The specific modification is as follows:

picture

picture

Construction effectiveness

By comparing the data before and after, we found that after adopting the multi-threading solution, the effect is still obvious. From the server-side processing time-consuming indicator shown in the figure below, we can see that the processing time of the node ending with 125.172 is about 1.7s, and in the traffic distribution quantity indicator, we can see that the traffic allocated to it is about 5 pieces/s; The processing time of the 160.25 node is about 0.14s, and the allocated traffic is about 58/s, which is generally in line with expectations.

picture

picture

At the same time, in order to ensure the stability of the entire service, we have added some monitoring indicators, such as: cache queue length, model failure rate, link time consumption, write feature failure rate, etc.

picture

epilogue

This article mainly introduces that when Flink is called in a heterogeneous cluster, if the server cannot distribute traffic, the client can implement dynamic load balancing of traffic through multi-threading, which can help the server to be compatible with high and low configuration models , Improve machine utilization efficiency. However, it needs to be pointed out that the multi-threaded operators used in this article are Stateless (stateless), and the stateful operators need to be considered appropriately. In addition, if the server is a component that can allocate nodes autonomously, you can choose to use the Async I/O solution.

Guess you like

Origin blog.csdn.net/zhongdianjinxin/article/details/132278317