[Failure] announcement released version of .NET Core blog site caused a large number of 500 errors

Sorry, this morning's blog site failures give us a lot of trouble, please understand. This is our fault release version of .NET Core blog site caused, although we were well prepared, but still underestimated the complexity of the issues under high concurrency.

The following is a fault with the background after roughly:

In the summer, we are in full swing busy the entire migration project to wrap .NET Core - release version of .NET Core blog site and blog background. Our other systems have already been migrated to the .NET Core and online work some time, leaving only the most difficult row to hoe - blog system, which this month to root hard iron hoe have been eating almost the same the line will draw a perfect full stop on its release our entire .NET Core migration project, and passing this milestone to meet the official version of .NET Core 3.0 release.

Therefore, the release version of .NET Core blog site and blog background become our most important job in August. .NET Core version of the blog site in July has been completed development, this time, performs further closed beta, while gray-release access to some of the production flow in order to find problems in our tests failed to find and fix, in after last weekend access to more production flow test and repair, we have confidence, we have assessed that already have official release conditions, except that we can not be simulated in the test environment blog complicated system in which the high concurrency scenarios.

So aside with confidence, while with concerns about high concurrency problems, we decided to publish it early this morning.

Deployment scenarios when the release is such that blog system based on .NET Core 3.0 Preview 7 (EF Core used or 3.0 Preview 5), 7 sets of Ali cloud centos server set up a docker swarm cluster, six 4-core 8G server as a worker node run blog site application container, 1 2-core 4G as manager of server nodes (not deploy any container), each worker node deployed an nginx and .net core blog application container, all requests are forwarded to the balanced Ali cloud nginx container, the container then forwarded by nginx .net core application to the container, nginx worker node server listens on port 80 by way of port mapping.

Such a deployment environment is our long proven, the only proven that blog is not so high concurrency system.

Risk wore two high concurrency problems (docker swarm and .net core), we were released at around 5:30 this morning.

Began a visit to a small volume, low concurrency, no problem, but the problem appeared to around 8:30, many blog page to open more than a second (normally a few tens of milliseconds), while in the container with the request not to curl command 10 ms.

$ docker exec -t $(docker ps -f name=blog_web -q) curl -H 'X-Forwarded-Proto:https' -w %{time_total} -o /dev/null -s localhost 
0.002876

Nginx problem is suspected, ready to re-create a docker cluster, do not use nginx kestrel monitor port 80 directly.

Later, a colleague pointed out, nginx is not a problem, is docker swarm port mapping performance problems at high concurrency, the only change host network port mapping mode can solve this problem.

Around 9:30, as more high concurrency, nginx container gettin 500 errors, thought it was too high server load in a cluster, then add servers to docker swarm cluster, but does not help, getting 500 errors many.

500 errors, sometimes refresh will get better, sometimes you want to refresh several times, some suspect that the cluster server is unstable, so a log on a cluster of servers into the container with the curl command to test, in addition to one server instability, other servers curl command to test the response speed are normal, the less stable server that sets off the assembly line, the problem persists, continue to increase as the amount of concurrency, 500 error also continue to increase.

After further analysis, the 500 error is suspected because of problems with network communication between the high concurrent nginx container under the .net core application container, so decided to give this release, fall back to run on Windows .net framework around 10:30 version of the blog site, returned to normal.

Guess you like

Origin www.cnblogs.com/cmt/p/11302666.html