Typical Project Case 11——Major Accident in Production Environment

1: Background introduction

For the arpro project. Two sets of environments are arranged in the production environment, one set of A environment and one set of B environment.
The significance of this is that if there are unexpected problems in the online A environment (a large-scale crash of the A environment is unavailable, problems with the A environment server, etc.), we can immediately switch the B environment for users to use.
The A environment is completely consistent with the B environment.

This time there were several major problems in the arpro production environment.

  1. The production A environment was not built in time, resulting in inconsistent versions between the production A environment and the production B environment
  2. The release log on ZenTao on May 1, 2022 has no associated requirements, resulting in omissions in the release process, which will affect the subsequent release process.
  3. The construction of the B environment is abnormal, and it is not processed in time, which will affect the timely release of the project.
  4. At present, the B environment is running online, and the A environment should actually be running; the upgrade switch has not been done in time.
  5. The system runs for a period of time, and the content occupation will increase sharply with the increase of time; the memory will reach the system unavailable in about a week (the week here refers to the current business volume, and the time may be shortened as the business volume increases. )

insert image description here
insert image description here

Two: ideas & solutions

The above questions 1-4 are all biased towards production and release issues, and question 5 is biased towards technical issues.

For the above five questions, we all need to be clear that the production environment is not a child's play and we need to have enough responsibility. You need to be in awe of the production environment.

Ideas & solutions for questions 1-4

On the basis of clarifying the value and significance of two sets of environments in the production environment; the importance of the production environment.
Process system

  1. There is a strict online process, complete one check one
  2. There is a strict online approval process, and subsequent online operations can only be performed after the approval is passed
  3. There are strict closed-loop processes (such as environmental testing, post-launch testing)
  4. Only people with a certain rank can operate the build of the production environment

Example of the online process:
insert image description here

insert image description here

Ideas & Solutions for Question 5

The general reason is that there is an unreasonable situation in the code, which leads to the creation of large objects. The object has always been referenced and the GC cannot be recycled. As time goes by, more and more objects that cannot be recycled cause the memory to gradually increase.

For such problems that need specific analysis, you can generate dump files by printing jvm snapshots, and you can use the memory analysis tool Jvisual that comes with jdk1.8 for memory analysis. Look for the cause of the rise in memory.

Four: Summary

  1. Be in awe of the production environment
  2. Guarantee the content through a certain form
  3. Permission isolation by dividing roles

Guess you like

Origin blog.csdn.net/wangwei021933/article/details/129596833