background
Hello, my name is Tong.
I went home from get off work last night. On the subway, the boss suddenly called. The production environment of system B responded slowly, which affected the use of system A. Tens of thousands of young brothers could not receive orders, and about 300,000 orders were stuck. You go and help Position it.
I got home around 8:30 and immediately joined the membership online.
reboot
When I joined the club, there were already colleagues helping to locate. As the saying goes, restarting can solve 80% of the problems. If restarting can’t solve it, it must be because the number of restarts is not enough. Bah, no, restarting can’t solve it. to be positioned.
Facts have proved that it is still useless to go through a wave of stress tests after restarting. With 1000 concurrent tests, the average response time is 3 to 4 seconds. This is the result of several consecutive pressure tests.
Upgrade configuration
The restart seems to be invalid. Enter the second stage - upgrade the configuration. Two 4-core 8G instances are upgraded to six 8-core 16G instances, and the database configuration has also doubled. The problems that can be solved with money, we generally Will not invest too much manpower ^^
Facts have proved that adding configuration is useless, 1000 concurrent, the average response time of the stress test is still 3~4 seconds.
Interesting.
At this point, Brother Tong and I intervened.
View monitoring
After I went online, I checked the monitoring, and the CPU, memory, disk, network IO, and JVM heap memory usage of the instance seemed to be all right. This is really a headache.
Local stress test
We divided into two waves of students, one to prepare for the local stress test, and the other to continue the analysis. After the local stress test, we found that the local environment, single machine, 1000 concurrent, all right, there were no gross problems, and the average response was basically maintained at hundreds of milliseconds.
It seems that there is indeed no problem with the service itself.
code walkthrough
There is really no other way, take out the code, a group of big men look at the code together, and the R&D classmates explain the business logic to us. Of course, he has been scolded to death by the big bosses, what kind of broken code he wrote, in fact, Brother Tong intervened Before, they have changed a wave of code, and there is a place where the redis command has been scan
changed keys *
. There is a hole here, but it is not the main problem now, we will talk about it later.
I read the code all the way and found that there are many redis operations, and there is a for loop in which the redis get command is being adjusted. There is no problem, the main problem may still be concentrated in the redis part, and the calls are too frequent.
add log
After checking the code, there is basically no problem except that it was scan
changed to keys *
(I don’t know this one yet), let’s add the log, add the log in a small section, OK, restart the service, and do a wave of stress testing.
Of course, the results did not change, analyze the log.
Through the log, we found that the call to redis is sometimes very fast, sometimes very slow, it seems that the connection pool is not enough, that is, a batch of requests goes first, and a batch of requests is waiting for an idle redis connection.
Modify the number of redis connections
Check the redis configuration, use the stand-alone mode, 1G memory, the default number of connections is 8, the client is still relatively old jedis, decisively changed to the default lettuce of springboot, the number of connections was first adjusted to 50, restart the service, and press a wave.
The average response time has dropped from 3 to 4 seconds to 2 to 3 seconds, which is not obvious. Continue to increase the number of connections, because we have 1,000 concurrent requests, and each request has many redis operations, so there will definitely be waiting. This time, we directly reached the number of connections to 1000, restarted the service, and pressed a wave.
It turns out that there is no significant improvement.
Check the log again
At this point, there is no good solution. We go back to the log again, check the time of redis related operations, and find that 99% of get operations are returned quickly, basically between 0 and 5 milliseconds. However, there are always a few that reach 800~900 milliseconds before returning.
We thought there was nothing wrong with redis.
However, after stress testing several times, the time has not been brought up.
Very helpless, at this time, it was already past 3:00 in the middle of the night, and the leader spoke up, calling out the people from HUAWEI CLOUD.
HUAWEI CLOUD troubleshooting
Finally, we called HUAWEI CLOUD-related personnel to investigate the problem together. Of course, they were reluctant, but who asked us to pay ^^
The person in charge of HUAWEI CLOUD recruited redis experts to help us check the redis indicators. Finally, it was found that the bandwidth of redis was full, and then the current limiting mechanism was triggered.
They temporarily tripled the bandwidth of redis, let's do another stress test.
Holding a piece of grass, the average response time suddenly dropped to 200~300 milliseconds! ! ! !
It's really a bit of a grass, this is a bit of a pit, you can limit the current, and don't call the police when the bandwidth is full. .
That's a real pain in the ass.
At this point, we thought that the problem was solved like this, and the leaders went to sleep~~
on production
Now that the cause of the problem is found, let's go to production and press a wave~
We asked HUAWEI CLOUD experts to triple the production bandwidth.
Pull a hotfix branch from the production submission, close the signature, restart the service, and go through a wave of stress testing.
It's over, the production environment is even worse, and the average response time is 5-6 seconds.
In the test environment, we changed the connection pool configuration, and the production environment is still jedis. Change it and take a wave.
There is no practical effect, it is still 5-6 seconds.
What a pain in the ass.
View monitoring
Looking at the monitoring of redis in HUAWEI CLOUD, the bandwidth and flow control are normal this time.
This time, the abnormality became the CPU, and the CPU pressure test of redis directly soared to 100%, resulting in slow application response.
Wake up HUAWEI CLOUD redis experts again
It's past four in the morning, and everyone has run out of ideas. The redis experts of HUAWEI CLOUD, please wake me up again!
Wake up the redis experts of HUAWEI CLOUD again, help us analyze the background, and found that 140,000 scans were performed within 10 minutes~~
Pay attention to Princess Tong, read the source code, and see more good articles!
Evil scan
I asked the R&D staff where scan was used (they changed it earlier, I don't know), and found that every request would call scan to get the key starting with a certain prefix, scan 1000 pieces of data each time, check the total number of redis keys, about 11 10,000, that is to say, a request needs to be scanned 100 times, 1000 concurrent, about 100,000 scans, we know that redis neutralization scan
requires keys *
a full table scan, which consumes a lot of CPU, 140,000 scan operations, directly Let the CPU go to heaven.
Why is the CPU in the test environment not up to the sky?
In comparison, the total number of keys of redis in the test environment and the production environment, the test environment has only 900 keys, and each request is scanned once or keys *
once, and there is no wool problem.
Why does the production environment have so many keys?
Ask the R&D staff why there are so many keys in the production environment and no expiration time set?
The R&D staff said that what was set is the code written by another colleague. When opening the code, it is really a magical code. It is inconvenient for me to post the specific code. In all cases, the expiration time is not set successfully.
Current workaround
At this time, it is already 4:30 in the morning, although everyone is still very excited, but after the decision of the leadership, it will not move for the time being, because system A has suspended calling system B, so at this time, system B can say that the traffic is almost For 0, we will fix this problem in two phases during the day.
The first step is to clean up the redis data in the production environment, leaving only a small part of the necessary data.
The second step is to modify the data at the beginning of a prefix of scan and change it to hash storage, which can reduce the scope of scanning.
Well, this is the end of the production accident investigation. In the follow-up, Brother Tong will continue to follow up.
Summarize
This production event is slightly different from the events encountered in the past, and can be summarized as follows:
- In the past, it was the CPU, memory, disk, and JVM of the application service itself. It was the first time that the bandwidth and current limit of redis were encountered;
- After joining HUAWEI CLOUD, many things have not yet been mastered, including monitoring indicators, which still need to be explored slowly;
- Redis must disable keys and scan commands, and most keys should be set to expire!
Well, I probably wrote so much about this incident. Brother Tong will continue to follow up if there are new situations in the future. Of course, it is best not to have new situations ^^