When traffic bursts and services are overloaded, how do distributed services limit the current threshold?

1. Problems faced

In a huge cloud computing + distributed + service-oriented system, there is a large network structure composed of tens of thousands of service nodes. The calling links are complex. Services on the same node have resource competition relationships, and the upstream and downstream nodes have a competitive relationship. There is also a pressure transmission relationship, so each service node and each service on it needs to implement traffic protection.

At present, traffic protection technologies such as current limiting and circuit breaker are relatively mature. However, in actual application, the relevant parameter settings are too loose and cannot play an ideal protective role when external traffic suddenly increases. Problems such as competition for service node resources and transaction timeouts still often occur. .

Therefore, how to reasonably evaluate and set the current limiting threshold of services has become a difficult problem that troubles system maintenance personnel. This research topic attempts to find a reasonable solution.

2. The process of solving

Common service current limiting indicators

Current limiting is to reject some requests according to rules for request traffic that exceeds a preset threshold to ensure that the key resources of the service node are not exhausted by the sudden increase in traffic and to avoid other secondary problems.

For service nodes, the more commonly used current limiting indicators are: the number of processing requests per second (TPS), and the maximum number of concurrency.

At the business level, service nodes are required to achieve a certain business processing capability, which is often described by how many requests can be processed per unit time. If the business has specified the request per second (TPS) index requirement, you can set the throttling threshold accordingly.

From the perspective of ensuring the stability and usability of the system, the best current limiting indicator is: the maximum number of concurrency. By limiting the maximum number of concurrency of the service, it can be guaranteed that the service will not have too much concurrent processing consuming resources at any time.

In theory, these two indicators can be roughly converted by the following formula:

并发数=TPS*服务平均响应时间

This article focuses on how to evaluate the current limit threshold of the maximum concurrency index.

Evaluation of overall node current limiting threshold

1. Theoretical analysis

The current limiting parameter refers to the indicator threshold that triggers current limiting. If the throttling threshold is set too high, the throttling cannot be triggered when the traffic is high, which will cause the application server to be overloaded, the performance will deteriorate, and the service will time out; if the threshold is set too small, the resource usage will always be at a low level, wasting server resources. The current limiting threshold of the evaluation service node is the number of requests that the evaluation service node can handle at the same time, that is, the maximum number of concurrent services. For service nodes, the maximum number of concurrent services is determined by the size of the service thread pool.

Therefore, the service thread pool size is the overall current limiting threshold of the service node, which determines the service node's processing capacity, resource usage, and operational stability. For such a critical parameter, it is recommended to evaluate it by means of stress testing.

During the stress test, a critical point is found where the resource usage is at a high level and the performance of the service does not drop significantly within the corresponding period. At this critical point, the number of requests processed per second (TPS) or the maximum number of concurrencies is the optimal service thread pool. size.

2. Pressure testing method

In order to achieve better evaluation results, the stress test environment should be as consistent as possible with the production situation: first, the hardware configuration of the service node should be consistent with production, and in addition, the transaction volume ratio of each service in the node should be consistent with production. If you consider the cost of testing, it is recommended to choose a service with a large proportion of transaction volume as a test sample.

For service nodes that are already running online, it would be more ideal if the real traffic during peak production periods could be recorded for playback testing.

During the stress test, gradually increase the concurrency of requests until the server resource usage reaches a high level (for example, the CPU reaches about 80%). The service concurrency at this time can be used as the concurrent current limiting threshold of the node, which is the service thread pool. Size, as shown in the picture below ▼.

picture

3. Things to note

During the stress test, you should simultaneously observe whether the average response time of the service has deteriorated. The average response time change curve in the above figure is relatively stable before the number of concurrency reaches the current limiting threshold, which is an ideal situation.

During the test process, the average response time of the service may also deteriorate, and the utilization rate of hardware resources such as CPU and memory is not high, which indicates the existence of other bottleneck issues, such as database access efficiency, object locks added by the program to solve thread safety, etc., which need to be investigated. After solving the bottleneck problem, continue the stress test.

Evaluation of individual service current limiting thresholds

1. Theoretical analysis

The overall current limiting threshold of the service node can protect the resource usage of the node from overload to a certain extent. However, resources are limited, and each node runs multiple services, and competition for resources will occur between services. Therefore, service-level current limiting must also be considered to prevent a few services from occupying too many resources. In order to limit the resources used by the service, the number of concurrent services is the most effective current limiting indicator.

Can the number of concurrencies of each service at the critical point of the stress test in the previous section be used as the current limiting threshold of each service? Based on the actual situation, the answer is no. Because the proportion of transaction volume of each service in the node changes at any time, constantly ebbing and flowing, such a current limiting threshold is too one-sided and lacks flexibility.

In order to evaluate the current limiting threshold of each service, a stress test needs to be performed on each service separately. First, we need to evaluate the proportion of resources each service is allowed to use as the resource usage target to be achieved when taking the critical point in the stress test.

Depending on the importance of the service, the resource usage it is allowed to occupy is different. For example, in the fast payment service node, the payment service is allowed to consume most of the node's resources, while other query and maintenance services are allowed to obtain fewer resources. .

2. Pressure testing method

According to the importance of each service, set the maximum value of its resource usage, and conduct stress test separately. During each service stress test, gradually increase the concurrency of requests until the server resource usage reaches the target level (for example, the CPU reaches about 50%). The number of service concurrencies at this time can be used as the concurrent current limiting threshold of the node, as shown in the figure below▼ .

picture

3. Historical data analysis and prediction method

In actual applications, the number of services in each service node ranges from dozens to hundreds. The manpower and time costs of comprehensive stress testing of all services are too high, and generally only a small number of key services can be stress tested.

However, the main goal of service current limiting is to limit the concurrency of non-critical services and ensure the resource supply of key services. Simply setting the current limiting threshold for key services fails to achieve this goal. Therefore, there is a contradiction between cost and benefit in calculating the service throttling threshold by the stress test method, and other solutions need to be sought.

We obtained the maximum daily concurrency number of the service in the past period from the production monitoring data and analyzed it. We found that the historical maximum concurrency number is of reference significance for evaluating the service current limit threshold, but the following issues need to be solved:

Question 1 : Since the load balancing algorithm is a random algorithm, the loads of different nodes in the same service cluster are not equal. There is a certain degree of randomness in the maximum number of concurrencies, and there is a certain probability of glitches.

In addition, some failures in the system lead to slow service processing or freezes, which will also cause abnormal growth in the number of concurrent services. None of these abnormal data should be used as the basis for evaluating the current limit threshold;

Question 2 : With business promotion, the transaction volume of the service will continue to grow, and the maximum number of concurrencies in the future may quickly exceed the original historical maximum number of concurrencies.

In view of the above problems, some processing can be performed on the historical data of the number of concurrent services to make the current limit threshold more reasonable:

1. Perform noise reduction processing on historical data to eliminate abnormal data caused by randomness or faults

Considering that the promotional activities of some businesses are cyclical, such as 618 every year and Double 11, it is necessary to obtain historical data of more than one year as the basis for analysis. Assuming that the data conforms to the normal distribution, the vast majority of the data is no more than three standard deviations from the mean (99.73%), so the data that is more than three standard deviations from the mean are removed as burrs.

2. Make trend predictions on historical data to reserve reasonable growth space for future business development.

Considering the amount of data and computing cost, we choose a typical time series analysis model: the autoregressive moving average model ARMA. The ARMA model combines the characteristics of autoregression (AR) and moving average (MA). The prediction results integrate the characteristics of the time series data. Trends and Volatility. The figure below shows the change trend and confidence interval of the concurrency count of a certain service in the next three months drawn using the ARMA model (the confidence level is set to 0.99). The upper limit of the confidence interval can be used as the current limiting threshold of the service.

picture

3. Practical Situation

Based on the above theoretical research and program discussion, it can be concluded that the overall current limiting threshold of the service node should be evaluated using the stress test method; for the current limiting threshold of a single service, noise reduction processing and trending should be carried out through historical concurrency data. Predict and generate recommendations at low cost.

During the implementation of the research project, we developed a traffic protection parameter management and control system to obtain the maximum daily concurrency number of services in the past year from the historical production monitoring data, and then based on the algorithm mentioned in the previous study, these data The analysis and calculation of noise reduction processing and trend prediction are carried out, and the suggested value of service current limiting parameters is obtained, and query and active push services are provided for users.

picture

This traffic protection parameter management and control system has recommended current limiting parameters for dozens of business systems. Users have reported that the current limiting thresholds recommended by the system are more reasonable and can more effectively prevent risks caused by sudden increases in traffic and ensure the stable operation of service nodes. .

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/132335219