Proficient in distributed, but not service governance?

At four o'clock in the morning, I was woken up by the company's monitoring alarm. The reason for the alarm was that the batch task in the production environment failed. Immediately got up to deal with the fault, but it still took a lot of time to solve it. This failure is a batch task for data verification, to verify whether the data of the previous batch task is correct. Fortunately, the previous core tasks have been completed and have not affected the trading system work in production. Why did I mention trading work here? Because the trading system is the entrance to the business flow of the entire system, if the trading system fails, it will bring direct revenue loss to the company. The topic we are talking about today is service governance. The ultimate result of service governance is the system's "7 * 24" uninterrupted service.

1 Monitoring alarm

The company's production alarm was very accurate. The direct maintainer of the system was found and notified which batch task failed. This alarm is triggered by monitoring the task execution results of the batch task middleware. In general, what types of alarms are there? Let's look at the picture below:

picture

1.1 Batch efficiency

In most cases, batch processing tasks do not hinder business access, so monitoring is not required. In the case of hindering business entry, batch tasks must be monitored. I give two business scenarios:

  • The domain name system needs to find out dirty data through dns information and database records for transaction compensation. During this period, customers querying domain name information may be dirty data
  • Real-time transactions are not allowed during the bank’s day-end batch running period, which is contrary to the “7 * 24” uninterrupted service

In these scenarios, batch processing efficiency is a very important monitoring indicator, and the timeout threshold must be configured and monitored.

1.2 Flow Monitoring

Commonly used current limiting indicators are as follows:

picture

We need to pay attention to a few points in traffic monitoring:

  • Different systems use different monitoring indicators. For example redis, indicators can be used QPS. For trading systems, indicators can be usedTPS
  • Configure appropriate monitoring thresholds through testing and business volume estimation
  • The monitoring threshold needs to consider emergencies, such as scenarios such as flash sales and coupon grabbing

1.3 Abnormal monitoring

Exception monitoring is very important to the system. In a production environment, it is difficult to ensure that the program does not have abnormalities. Configuring reasonable exception alarms is crucial to quickly locate and solve problems. For example, in the batch running warning mentioned in the beginning, there was an exception in the warning information, which allowed me to quickly locate the problem. Abnormal monitoring needs to pay attention to the following aspects:

  • Client read timeout, at this time, find out the reason from the server as soon as possible
  • Set a threshold for the time the client receives a response, such as 1seconds, and trigger an alarm after exceeding
  • Business exceptions must be monitored, such as failure response codes

1.4 Resource Utilization

When configuring system resources in a production environment, it is generally necessary to have a forecast for the usage of system resources. For example, redisat the current memory growth rate, how long will it take to run out of memory, and how long will it take for the database to run out of disk at the current growth rate. System resources need to set a threshold, for example 70%, if the limit is exceeded, an alarm will be triggered. Because when the resource usage is about to be saturated, the processing efficiency will also drop severely. When configuring the resource usage threshold, you must consider the sudden increase in traffic and business, and reserve additional resources in advance to deal with it. For core services, flow-limiting measures should be taken to prevent sudden traffic surges from overwhelming the system.

1.5 Request Latency

Request delay is not an indicator that is easy to count. The following figure is an e-commerce shopping system:

picture

In this diagram, we assume that the composite service will concurrently call the underlying order, inventory, and account services. After the client sends a request, it takes seconds for the composite service to 2process the request, and seconds for the account service 3. The minimum client configuration read timeoutis 5seconds. The monitoring system needs to set a threshold to monitor. For example, 1if there is 100a request delay greater than 2 seconds 5, an alarm will be triggered to let the system maintenance personnel find the problem. The setting on the client side read timeoutshould not be too large. If the delay is caused by a server failure, it must be ensured fail-fastto prevent the system performance from being greatly reduced due to the failure of resources to be released.

1.6 Monitoring Precautions

Monitoring is to enable system maintenance personnel to quickly discover production problems and locate the cause. However, the monitoring system also has several indicators to consider:

  • Specify the sampling frequency of monitoring indicators according to the monitoring objectives. If the frequency is too high, the monitoring cost will be increased.
  • Monitoring coverage, it is best to cover all core system indicators.
  • For monitoring effectiveness, the more monitoring indicators, the better. Too many will bring extra workload to distinguish the effectiveness of alarms, and it will also make developers get used to it.
  • Alarm timeliness. For non-real-time transaction systems such as running batch tasks, real-time alarms may not be used. Set a time after recording events. For example, an alarm is triggered at 8 o'clock in the morning, and the responsible person will deal with it after arriving at the company.
  • To avoid long tail effects, it is best not to use averages. As shown below:

picture

10 requests, 9 delays are 1 second, but 1 delay is 10 seconds, so the average value is not very meaningful. Grouping by interval can be used, such as the number of requests delayed within 1 second, the number of requests within 1-2 seconds, and the number of requests within 2-3 seconds are grouped for statistics, and the monitoring threshold is configured in an exponential growth manner.

2 Fault management

2.1 Common fault causes

There are various reasons for failures, but the common ones are as follows:

  • Issues caused by release upgrades
  • hardware resource failure
  • system overload
  • attack on purpose
  • Basic service failure

2.2 Coping strategies

To deal with failures, we take two steps:

  • Solve the fault immediately, such as a fault caused by a data problem, just modify the problem data.
  • To find out the cause of the failure, you can find the log or call the chain tracking system to locate the problem and solve it

2.2.1 Software upgrade failure

Some of the faults caused by the upgrade can be exposed soon after going online. Some are exposed after a long time on the line, for example, some business codes may not have been executed before. For the first case, it can be verified and solved by means of gray release. For the second case, it is difficult to avoid it completely, we can only maximize the test case coverage.

2.2.2 Hardware resource failure

These failures fall into two main categories:

  • Hardware resources are overloaded, such as insufficient memory
  • Hardware resource aging

For the first type of failure, the person in charge is usually notified to deal with it by monitoring and alarming. The main way to deal with it is to increase resources and find out programs that consume serious resources for optimization. For the second type of failure, operation and maintenance personnel need to record and monitor hardware resources, and replace aging resources in a timely manner.

2.3 System overload

The system overload may be due to a sudden increase in traffic such as a spike, or it may be that the business development gradually exceeds the system's capacity, which can be dealt with by increasing resources or limiting traffic.

2.4 Malicious attacks

There are many types of malicious attacks, such as DDOS attacks, malware, and browser attacks. There are many ways to prevent malicious attacks, such as encrypting request messages, introducing professional network security firewalls, regular security scans, and deploying core services on non-default ports.

2.5 Basic software failure

As shown in the figure below, except for business services, each component is basic software, and high availability needs to be considered.

picture

3 Release Management

Release usually refers to software and hardware upgrades, including business system version upgrades, basic software upgrades, and hardware environment upgrades. As a programmer, the upgrade mentioned in this article is for the upgrade of the business system.

3.1 Release process

In general, the business system upgrade process is as follows:

picture

Publish to the production environment, and verify that there is no problem indicating that the release is successful.

3.2 Release quality

When upgrading software, release quality is very important. To ensure release quality, you need to pay attention to the following issues.

3.2.1 CheckList

In order to ensure the release quality, maintain a copy before release CheckList, and the development team will confirm all the problems. After the list is confirmed and completed, build and release. Here are some of the more typical questions:

  • Is the online sql correct?
  • Whether the configuration items of the production configuration file are complete
  • Whether the externally dependent services have been published and verified
  • Whether the new machine routing authority has been opened
  • Is the publishing order of multiple services clear?
  • What to do if a failure occurs after going online

3.2.2 Gray scale release

Grayscale publishing refers to a publishing method that can smoothly transition between black and white. As shown below:

picture

When upgrading, the canary deployment method is adopted. First, one of them serveris released and upgraded as a canary. serverAfter this is running in the production environment, there is no problem, and then the others are upgraded server. Roll back if there is a problem.

3.2.2 Blue-green deployment

The blue-green deployment method is as follows:

picture

Before the upgrade, the client's request is sent to the green service. After the upgrade is released, the request is transferred to the blue system through load balancing. The green system will not be offline temporarily. If there is no problem in the production test, the green system will be offline, otherwise switch back to the green system. The difference between blue-green deployment and canary deployment is that canary deployment does not need to add new machines, while blue-green deployment is equivalent to adding a new set of machines, which requires additional resource costs.

3.2.4 ab test

abTesting refers to releasing multiple versions in the production environment, the main purpose is to test the different effects of different versions. For example, the page style is different, and the operation process is different, so that users can choose a favorite version as the final version. As shown below:

picture

The services of three colors are deployed, and the client's request part is sent to the service with the same color as itself. The version of the ab test has been verified without problems, which is different from the grayscale release.

3.2.4 Configuration changes

Many times we write the configuration in the code, such as yamla file. In this way, after we modify the configuration, we need to re-release the new version. If the configuration changes frequently, the following two methods can be considered:

  • Introduce configuration center
  • Save configuration using external system

4 Capacity Management

In section 2.3, system failure caused by system overload is mentioned. Capacity management is an important link to ensure the stable operation of the system after it goes online. It is mainly to ensure that the system traffic does not exceed the threshold that the system can withstand and prevent the system from crashing. In general, the reasons for system capacity overload are as follows:

  • Continuous business growth brings more traffic to the system
  • System resource shrinkage, for example, a new application is deployed on a machine, occupying some resources
  • The system processing requests slows down, for example, because the amount of data increases and the database responds slowly, resulting in a longer processing time for a single request and resources that cannot be released
  • Increase in requests due to retries
  • Sudden increase in traffic, such as the Weibo system encountering news about celebrity divorces.

4.1 Retry

Retrying some failed requests can greatly increase the user experience of the system. Retries are generally divided into two categories, one is a request for a connection timeout, and the other is a request for a response timeout. For a connection timeout request, it may be caused by a transient network failure. In this case, retrying will not put pressure on the server, because the failed request does not reach the server at all. But for requests that respond to timeouts, retrying may put additional pressure on the server. As shown below:

picture

Under normal circumstances, the client calls the service first A, and the service Acalls the service again B, and the service Bis called only once. If the service Bresponds slowly and causes a timeout, the client configures 2the number of failed retries, and the service Aalso configures 2the number of failed retries. If the service Bfails to respond in the end, the service Bis finally adjusted 9once. In a large distributed system, if the call chain is long and each service is configured with retries, the retries will cause huge pressure on the downstream services of the call chain and even cause the system to crash. It can be seen that more retries are not better, and reasonable setting of retries can protect the system.

For retries, there are the following 3suggestions:

  • Non-core business does not retry, if retry, the number of times must be limited
  • The retry interval should increase exponentially
  • Retry according to the status of the failure returned. For example, if the server defines a rejection code, the client will not retry.

4.2 Burst traffic

It is difficult to plan in advance for the sudden increase of traffic. When encountering a sudden increase in traffic, we can first consider increasing resources. For K8Sexample, if there is 2one pod, use deployorchestration to expand the capacity to 4one pod. The command is as follows:

kubectl scale deployment springboot-deployment --replicas=4

If the resources have been used up, then you have to consider throttling. Several current limiting frameworks are recommended:

  • google guava
  • netflix/concurrency-limits
  • sentinel

4.3 Capacity Planning

It is very important to do a good job of capacity planning in the early stage of system construction. The system can be estimated according to the business volume QPS, based on QPSthe stress test. The capacity estimated based on the results of the stress test may not be able to cope with real scenarios and emergencies in the production environment. Reserved resources, such as 2double capacity, can be given based on the estimated capacity.

4.4 Service downgrade

There are three ways to downgrade the service for the server:

  • After the server capacity is overloaded, new requests are directly rejected
  • Non-core services are suspended and resources are reserved for core services
  • The client can perform downgrade processing according to the proportion of requests rejected by the server. For example, after observing 1minutes, if the server 1000rejects 1 request, 100the client can use it as a reference. If it exceeds 901 request per minute, it will directly reject it.

5 summary

The micro-service architecture brings many benefits to the system, but it also brings some technical challenges. These challenges include service registration and discovery, load balancing, monitoring management, release upgrades, access control, etc. Service governance is to manage and prevent these problems to ensure the continuous and smooth operation of the system. The service governance solution mentioned in this article is also a solution in the traditional sense. Sometimes there will be some code intrusion, and the choice of framework will also limit the programming language. In the cloud-native era, Service Meshthe emergence of the cloud platform has brought the topic of service governance into a new stage. For Service Mesh, please refer to those things about Service Mesh

Guess you like

Origin blog.csdn.net/qq_28165595/article/details/131992835