Spring Cloud service avalanche caused by requesting upstream interface without setting timeout

[Performance optimization method] Spring Cloud parameter optimization with tens of thousands of concurrency per second
I think the timeout scenario is similar to mine

Upstream and downstream in microservice architecture

There is also talk about upstream and downstream services in systems consisting of microservices (or just old-fashioned distributed services).
insert image description here
Dependency rules and value rules also apply in this case

Service B is upstream because Service A depends on it. Service A is a downstream service because it adds value to Service B.

Note that the "flow" in the definition of upstream and downstream in this case is not the flow of data into the system through service A, but the flow of data through the system all the way to the user-facing service.

The closer a service is to the user (or any other end consumer), the further downstream it is.

background

The XX fee balance query interface developed a few months ago, called the interface provided by someone, and it has been running for several months without any problems, but suddenly at night, our official account applet, entering the home page, became very slow, After investigation, it was found that the personal query interface suddenly updated the service, causing the query to become very slow. In addition, there was no timeout period when the request was made, and the personal interface was always waiting for the data to be returned, resulting in too many connections and exhausted threads in the thread pool. , when other requests come in and there are services calling other downstream services, it gets stuck at this time. Here, the user pays successfully, recharges the phone bill, and the recharge is completed. The user's order accumulation, coupon status, and distributor commission are to be done. Changes and other operations, these are to call other downstream services, because the thread is gone, the call to other downstream services may time out or be blocked, and then openfegin triggers a timeout exception, because there is no fuse or anything, just don’t go down Execution caused the business to be abnormal. At this time, an avalanche occurred, and the entire service link was hung up.

solution

That is, when the original external interface is requested, a timeout must be added, otherwise it will collapse if the request volume is too large, and then if possible, the fuse Sentinel must be installed to downgrade or fuse the service.

Fuse method: directly request local fast failure method
Downgrade method:
1. Denial of service
Determine the source of the application, and reject the service request of the low-priority application during peak hours to ensure the normal operation of the core application. You can also randomly reject the request and directly return that the server is busy to avoid too many requests at the same time, which is especially used in e-commerce spikes.
2. Shut down services
Since it is a peak period, you can shut down some unpopular or marginally unimportant services to give up resources for core services. For example, Taobao shuts down some services unrelated to the core business of ordering, such as evaluation and confirmation of delivery, to ensure that users place orders and pay normally during Double 11 every year. Basically the server is busy.

Knowledge points:
What is a service circuit breaker?

Guess you like

Origin blog.csdn.net/Fire_Sky_Ho/article/details/126412735