Remember the emergence and handling of a large number of 499http status code problems at one time

table of Contents

Problem Description

problem analysis

problem solved

Concluding thoughts


Problem Description

1 19:38 DBA found that the encryption and decryption database DTS was delayed . Checking the database monitoring, it was found that a large number of writes occurred in the ALI computer room .

2 19:41 The architecture team researched and checked the cat monitoring and found that a total of more than 20,000 499 status codes appeared in some other services such as encryption and decryption services and message center services  .

3 19:43 The encryption and decryption database DTS is back to normal , and there is no longer a large number of writes in the ALI computer room of the encryption and decryption database.

4 20:00 Encryption and decryption service research and development found that the main source of traffic is the message center service , immediately contact the person in charge of the message center service to understand the business situation.

5 20:27 The encryption and decryption service R&D analysis business log found that the encrypted data were all non-mobile phone numbers , and the situation was immediately synchronized to the person in charge of the message center service.

6 20:37 The person in charge of the message center service contacted the sales staff. The problem was that the operator sent the studentId of 500,000 as a mobile phone number when pushing it .

Name description:

DTS : Data Transmission Service, a real-time data streaming service provided by Alibaba Cloud.

499 status code : the client initiates a request to the server, and the processing time of the server is too long, exceeding the timeout period of the client, and the client actively disconnects.

Why do you need DTS? Since the encryption and decryption service is a dual computer room, data synchronization between the two computer rooms is required through DTS.

problem analysis

In response to the above description, we conducted the following analysis:

1 When there is a large amount of DTS delay in the encryption and decryption database, why does R&D not perceive it ? Because the delay alarm of DTS is not added to R&D.

2 Why encryption and decryption database DTS large number of delays ? Because the encryption and decryption database ALI machine room has a lot of writing.

3  Why ALI database encryption and decryption engine room there is a large number of write ? The query traffic comes from the service of the message center, and the research and development communication with the message center, the traffic is the same as usual.

4 The traffic of the message center service is the same as usual. Why does the encryption and decryption service have a lot of writing ? The traffic this time is the studentId that has not been encrypted. After encryption, there will be a lot of write database operations.

5 Usually the mobile phone number is encrypted, why encrypt the studentId this time? The operator made a mistake and sent the studentId as a phone number.

6 Even if the operator is wrong, why can I still send studentId as a phone number?  The message center did not verify the parameters of the phone number.

7 What impact does the large number of writes in the ALI computer room of the encryption and decryption database have on the service ?

 A large number of database writes will cause the database connection to be full, the encryption and decryption service requests will accumulate, and the response will be slow. The client actively disconnected, a large number of 499 errors appeared, and it was temporarily unavailable.

8 Does the temporary unavailability of encryption and decryption services affect the PX core link ?

The PX core link has a login link. The login link needs to encrypt the phone. If you enter the classroom in advance, it will not be affected; if you log in during the accident time period, it will be temporarily unavailable.

9 There are a lot of 499 status codes in the encryption and decryption service, why don't R&D perceive it ? No monitoring is added for this R&D 499.

problem solved

1 DTS has delayed adding alarm to R&D.

2 Add 499 status code alarm to the research and development.

3 The encryption and decryption service accesses the current limiting component to perform current limiting processing on the message center service.

4 The front-end and back-end of the operational configuration back-end verify encryption parameters to avoid misoperation.

Concluding thoughts

1 As the underlying service, the encryption and decryption service needs to be strengthened in stability and availability, and the upstream will always be untrustworthy .

2 When developing a system, you should better consider the robustness of the system , because you never know the posture of the person who uses this system.

3 The simpler and more foolish the system for non-professionals , the better . The best thing is not to let others use their brains, so that problems can be better avoided.

Guess you like

Origin blog.csdn.net/jack1liu/article/details/112135898