1. Write in front
1. What is chaos?
The concept of Chaos Engineering was proposed by Netflix in 2010. By actively introducing abnormal states into the system and determining optimization strategies based on the behavior of the system under various pressures, it is a new method to ensure system stability.
Chaos engineering is the discipline of conducting experiments on distributed systems with the goal of building on people's knowledge of how complex systems can withstand unexpected events in a production environment.
2. Why do chaos?
Chaos engineering simulates the imperfect environment in the real world by intentionally introducing faulty, abnormal or uncertain conditions. Its core idea is to gradually verify and improve the robustness of the system by actively introducing faults and abnormal conditions, thereby increasing the stability and reliability of the system in the face of complex environments in the real world. Its purpose is to identify potential system weaknesses and improve the robustness and resilience of application systems, reduce the impact of system failures, and provide a better user experience.
3. The principle of chaos
Chaos engineering mainly follows the following principles:
2. The chaotic development of Y
In the past three years, JD Chaos Engineering, as one of the three lines of defense, has played a very important role before the promotion, and Y’s chaos practice has also been continuously upgraded, mainly from the two aspects of application coverage and scene coverage. The direction has clearly defined the direction of improvement, and has achieved a series of breakthroughs and achievements in the group chaos competition.
1. Exploration stage (21 years)
Dating back to 618 in 21, Y mainly aimed to explore pilot projects. The coverage of chaos test applications mainly focused on non-level 0/1 applications. The drill scenarios mainly focused on simple scenarios such as network disconnection drills, and both offensive and defensive launches were developed.
2. Development stage (22 years)
With the iterative upgrade of Jingdong Chaos Engineering in 22 years, the drill scenarios and system usability have been significantly improved. The Y side also focuses on comprehensive coverage of the chaos drill scenarios, expanding from basic resource failures to external dependency failures to advanced Scene additions to continuously improve system stability. At the same time, the 0/1-level core system is gradually covered, and the chaos drill operation manual, chaos drill specifications, etc. are accumulated. The drill takes testing as the offensive side and research and development as the defensive side, and the division of responsibilities is clarified.
3. Growth stage (23 years)
After 22 years of actual combat summary, 618 Y focused on promoting the improvement of application coverage in 23 years, and finally reached 99.68%, the retail TOP1. Practice strategy Prioritize the completion of the 9 major scenarios recommended by the system according to the requirements of the group, and at the same time select some specific scenarios in a targeted manner, improve system monitoring, and finally level 0/1 application health score > 95 points, high-risk items are cleared. During the promotion period, the performance of each system reached the standard, and wireless accidents occurred . While achieving staged results, it is inseparable from the fact that team members strictly abide by the following principles at each stage and treat each drill with high standards:
3. The difference between chaos and traditional testing
Chaos engineering is an experimental method that helps us gain new insights into a system. It is fundamentally different from the existing methods of testing known properties such as functional testing and integration testing. Chaos engineering is an experimental method designed to help us obtain more new cognitions about the system, and usually opens up a broader cognitive space for complex systems.
Traditional testing aims to give a specific condition, and the system will output a specific binary result. It is only a test of the possible values of known system attributes.
The way of thinking of chaos engineering is to actively find faults, which is exploratory. Although the downgrade plan was prepared according to the plan, when the node was shut down, the upstream service failure was triggered, which led to an avalanche, which could not be detected by fault injection or pre-planning.
Fourth, write on the back
Chaos engineering is a complex technical means to improve the resilience of technical architecture, aiming to nip failures in their infancy, that is, to identify them before they cause disruption. By actively creating faults, test the behavior of the system under various stresses, identify and fix fault problems, and avoid serious consequences.
With the continuous launch of new system functions and changes in dependent parties, etc., it may cause a series of unknown failures in the system. Therefore, the most important thing in the practice of chaos engineering is to be sustainable. By increasing the number of chaos experiments, the value of chaos engineering is constantly exerted. Y has been on the way!
Author: JD Retail Li Jinping Ma Chunrong
Source: JD Cloud Developer Community
Clarification about MyBatis-Flex plagiarizing MyBatis-Plus Arc browser officially released 1.0, claiming to be a substitute for Chrome OpenAI officially launched Android version ChatGPT VS Code optimized name obfuscation compression, reduced built-in JS by 20%! LK-99: The first room temperature and pressure superconductor? Musk "purchased for zero yuan" and robbed the @x Twitter account. The Python Steering Committee plans to accept the PEP 703 proposal, making the global interpreter lock optional . The number of visits to the system's open source and free packet capture software Stack Overflow has dropped significantly, and Musk said it has been replaced by LLM