This article is shared from Huawei Cloud Community " FT-FMEA Fusion Chaos Drill, Retail Operation System Resilience Architecture Online Verification Practice ", author: "Huawei Cloud Deterministic Operation and Maintenance Case Collection (Issue 2)" Nie Gang.
1. Business background
The business scope of a certain retail company covers 20+ provinces and hundreds of cities, providing services to thousands of households and is favored by the public. In recent years, in the face of new retail and the continuous expansion of business scale, the company has been committed to achieving comprehensive digitalization of its business, continuously developing new IT products, covering everything from supply chain to marketing, customer service to store operations, and gradually implementing digital transformation. To reduce operating costs and improve operating efficiency.
A certain system is a newly developed IT product for this retail enterprise. It has been launched into the production environment and plans to officially start offline business access and traffic drainage. Through chaos drills, the architectural resilience of the application production environment is "mined" and "accepted" before diversion to ensure that there are no major stability risks during official diversion.
2. Business status
With the expansion of digital transformation and business scale, the company has developed a new store operation system XX. The main body of the system adopts containerized deployment and relies on more than 15 surrounding systems. Among the systems it relies on are old systems that are more than 10 years old, which poses great potential usability risks. Since it is responsible for the operation of all stores, the company hopes that the IT system will have high resilience to cope with potential failure risks such as unexpected disasters, unavailability of dependent systems, instantaneous heavy traffic during promotional activities, and operator network failures.
3. Plan practice
The chaos drill of the COC platform carries the best practices of Huawei Cloud chaos drill, including the entire process from risk identification, emergency plan formulation, fault injection to drill review. Risk identification uses the FT-FMEA risk analysis methodology, and fault injection uses self-developed Fault injection probe. Practiced in Huawei Cloud for more than 4 years, running more than 3,000+ automated chaos drills every year, saving more than 1,500 hours of drill manpower. The design process is as follows:
1 . Risk identification and management
Combined with the deployment architecture and external dependency graph of XX application, the risk of the application in the production environment is analyzed based on the FT-FMEA failure analysis method to form a failure mode. COC has built-in Huawei Cloud FT-FMEA fault analysis method to help users efficiently analyze system risks and form fault modes from the aspects of system architecture, SLO requirements, fault scenario classification, fault occurrence conditions, customer impact, etc.
FMEA (Failure Mode Effect Analysis) originated from NASA. It mainly starts from the functional points of the business and lists possible failure modes, effects and causes, and corresponding control methods, combined with factors such as the severity of the fault, probability of occurrence, and detectability. Finally, the RPN multiplier score is obtained for the mode, through which the risk level of the failure mode can be judged. FMEA provides a risk-oriented failure analysis method, but the classification levels of failure probability, severity, and detectability level in FMEA reach 10, which is difficult to match in actual implementation and can easily lead to the divergence of failure modes, thereby affecting the efficiency of failure management. . Huawei Cloud has summarized FT-FMEA (Fault Scenario Analysis Method based on Fault Tolerance Perspective) from practice. Based on FMEA, combined with SRE practice scenarios, it is integrated into a 7-dimensional fault analysis framework. It is a fault analysis method specifically oriented to SRE scenarios. It can effectively improve the efficiency and quality of fault scenario analysis on the basis of ensuring comprehensive fault analysis without divergence of fault modes.
The list of fault modes summarized after using FT-FMEA on COC for XX IT system is as follows. The original 90+ fault modes were merged into 30+, which laid a solid foundation for subsequent emergency plan formulation and fault injection scheme design.
2 . Develop emergency plans
Based on the analyzed failure modes, combined with COC's built-in Huawei Cloud emergency plan guidance template and the actual operation and maintenance situation of the retail enterprise, a corresponding emergency plan was developed for each failure mode. COC supports full automation, automation + manual hybrid, and emergency plans for these two methods to cope with the emergency recovery needs of different failure modes.
3. Develop a drill plan
Based on the failure mode and the busy business period of the IT system, a drill plan is developed on the COC.
4. Design fault injection plan, perform drills, and emergency recovery
Based on the failure mode and the deployment situation of the application, a drill plan is designed to verify the self-healing ability of the IT system, the emergency plan ability, and the recovery ability of the operation and maintenance personnel.
1) Based on the selected failure mode, select the attack target and attack scenario on the COC to form a drill task to accurately simulate the conditions for the failure mode to occur.
2) Start an automated drill, observe whether the monitoring system can quickly detect faults and alarms, the self-healing time of the IT system, whether the operation and maintenance personnel can operate skillfully according to the emergency plan, and finally record the RTO of the system.
5. Exercise review and summary
The COC platform automatically scores this exercise, and the observation team of this exercise enters improvement matters in the COC. The system's RTO did not meet the standard during this exercise. In addition, a total of 18 problems were found during the exercise. Typical problems include: lack of monitoring, functional bugs in the alarm system, and certain differences between the actual deployment of the IT system and the design drawings. , system dial-up test is missing, operation and maintenance personnel are not proficient in the use of operation and maintenance tools, etc.
4. Business improvement
This drill uses the COC platform to conduct a full-process, multi-scenario chaos drill on the XX IT system. The results achieved by the drill are as follows:
1) Comprehensive analysis of potential risks of XX IT system, using FT-FMEA analysis method, while ensuring comprehensive risk identification, the number of fault modes was reduced from 90+ to 30+, a reduction of 66.66% , achieving fault mode convergence and improvement. The goal.
2) An emergency plan was developed for each failure mode and stored on the COC platform. The feasibility of the emergency plan was verified and improved through drills, and a reliable and efficient recovery capability was established for the potential risks faced by the IT system.
3) The automated drill capability of the COC chaos drill platform increased drill efficiency by 10+ times , and 18 problems were discovered during the drill. Through improvements and implementation, the system SLO was increased to 99.99% , meeting the reliability requirements of the system for store operations.
Summary of five cases
This case is based on the high availability requirements of the XX system of a retail enterprise and uses the COC platform to conduct risk analysis, emergency plan formulation and fault drills. This exercise used the FT-FMEA risk analysis method to quickly and efficiently identify the risks faced by the system, and verified the system's risk points and the effectiveness of the emergency plan through automated fault injection. Improvements and implementation of the problems discovered in the drill were carried out to increase the system SLO to 99.99%, meeting the reliability requirements of the system for store operations.
Drills are the best way to test and improve system availability. Combined with the operation and maintenance conditions of retail enterprises, the following best practice principles for chaos drills are summarized:
1. Clarify the evaluation criteria
• The entire process of chaos drills can generate value. The outputs and evaluation criteria for each link of chaos engineering must be clarified and carried into the online drill platform.
• Chaos drill is a technology that proactively exposes risks. It encourages R&D and operation and maintenance personnel to proactively expose risks through timely incentives and develop emergency plans for risks.
2. To conduct chaos drills, failure mode analysis must be done first.
• The failure mode, as the starting point of the drill, determines the quality of the drill. The emergency plan, as a recovery method, guarantees the safety of the drill and the rapid recovery of daily faults.
• Failure modes analyzed using the FT-FMEA method can accurately identify risks while also effectively avoiding the divergence of the number of failure modes.
3. Use automated drills
• Automated drill tools can lower the threshold for drills, improve drill efficiency, and ensure the safety and accuracy of fault injection.
• Automated drill tools can manage drills online to ensure timely execution of drills and the inheritance and accumulation of drill experience.
4. Carry out drill operations
• The Blue Army can coordinate and organize larger-scale drill activities. While testing the resilience of each IT system, it can also demonstrate and drive daily drills of independent systems, so as to achieve the effect of daily drills and no blind spots.
• Operating and publicizing drill activities and drill results can make IT development and operation and maintenance personnel aware of the risks that the system may face, and proactively implement quality culture in the R&D and operation and maintenance processes.
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
Microsoft's China AI team collectively packed up and went to the United States, involving hundreds of people. How much revenue can an unknown open source project bring? Huawei officially announced that Yu Chengdong's position was adjusted. Huazhong University of Science and Technology's open source mirror station officially opened external network access. Fraudsters used TeamViewer to transfer 3.98 million! What should remote desktop vendors do? The first front-end visualization library and founder of Baidu's well-known open source project ECharts - a former employee of a well-known open source company that "went to the sea" broke the news: After being challenged by his subordinates, the technical leader became furious and rude, and fired the pregnant female employee. OpenAI considered allowing AI to generate pornographic content. Microsoft reported to The Rust Foundation donated 1 million US dollars. Please tell me, what is the role of time.sleep(6) here?