Quality and Efficiency are Equally Emphasized, Test Left Shift Helps R&D of Block Storage Technology

  Author: Haru Shino

The cost of fixing a bug varies greatly at different stages. The earlier a problem is found, the lower the cost of fixing it. This article will describe the practice of shifting left in the test of Alibaba Cloud block storage in real business scenarios.

1. Why test left shift?

As we all know, the software engineering principle: the earlier the problem is found , the lower the cost of fixing the problem. In the book "Code Encyclopedia", from the perspective of software engineering practice, it is explained that the cost of a bug is very different in the product requirements analysis stage, development stage, testing stage, and production stage. In terms of cost and other calculations, the cost of fixing a bug in the integration testing phase is 40 times that of the coding phase.

What is test shift left? That is, the test is extended to the left, allowing the test to intervene before the code is tested. For example, expand to the development stage, consider the testability of the product when designing the architecture, and conduct self-test development. Shifting tests to the left is a concept. Code gate control  [1]  is a typical practice of shifting tests to the left, that is, submitting code automatically triggers compilation and testing, and blocking code submission if the build fails.

Figure 1 code access control

By shortening the test feedback arc, the code access control finds defects as early as possible. The code access control of the Alibaba Cloud block storage team effectively intercepts more than 100 cases in a single day, intercepts multiple business logic defects, more than 100 process crashes, data security defects, and CPU/Mem resource usage defect etc. If there is no code access control, problems will be covered up and accumulated.

2. When to Test Shift Left?

Is it the best time to establish strict code access standards and CICD system in the initial stage of the system? Considering the single dimension of quality assurance, the answer is yes. However, from the perspective of overall business benefits, it is not the global optimum. Rational view of technical debt, technical debt is like a loan, there are good and bad, you can cash out ahead of consumption to buy a house, but correspondingly, there is interest, compound interest, development is becoming more and more difficult. 

RethinkDB and MongoDB lost the competition. Technically, RethinkDB is more perfect than MongoDB, but it released a stable version three years later than MongoDB, and missed the golden opportunity of NoSQL. For details, please read "RethinkDB: why we failed"[2 ]  . As shown in the figure below, the balance between technical debt and business-first time for the three ABC companies:

  • Company A: only focus on business, not technical debt;

  • Company B: Continue to pay attention to technical debt, but not sensitive to business timing;

  • C Company: Continue to pay attention to business and technical debt. Be sensitive to business opportunities, borrow initially, keep technical debt under control and repay it when the time is right.

Figure 2 Technical Debt (Source: "The Balance of Quality and Speed: Make "Only Fast and Unbreakable" Faster and Longer," Ge Jun)

 Alibaba Cloud Block Storage released ESSD [3] in early 2018.  The industry's first million IOPS cloud disk service has a 50-fold leap in performance. In real business scenarios, the writing speed of the PostgreSQL database is 26 times faster. . After overdrawing a large amount of technical debt and taking the lead in the market, the team concentrated on repaying technical debt and carrying out quality construction. One year later, ESSD reached the quality standard of scale deployment and carried out commercialization.

3. Principles and practice of testing shift left

After overdrafting a large amount of technical debt in advance, the team has developed a "rough and fast" R&D habit. Changing habits and testing the left shift will face the challenge of landing. It is necessary to adjust expectations from top to bottom. The early stage of testing the left shift will inevitably bring the project The slowdown of the delivery cycle, in the long run, the overall efficiency is higher.

The principle of shifting the test to the left is not the same as the principle of testing. Practice has summarized three principles of shifting the test to the left as follows:

Table 1 Test left shift principle

Principle 1: Shift left standard consensus

Establishing left-shift standards and reaching a consensus within the team is the most basic principle of test left-shift. In the code access control system, code coverage card points and static code quality scanning are required. The requirement of business coverage [4] is that all functional tests must  be  in Add corresponding test case coverage in the code gate control phase.

For example, the block storage cloud disk is a distributed storage system. By establishing a one-click and second-level cluster environment of Cluster in Docker to realize the Function Test test scaffolding, the full-link E2E test has the soil for landing in the code access control stage. During the new Feature review, the functional test does not accept the manual test report, but only accepts the Code Review of the Function Test List, so as to avoid manual testing without automatic precipitation, and the same problem recurs repeatedly.

Principle Two: Insist on Rapid Feedback

Early detection and early treatment, the earlier the treatment, the lower the cost of repair. For defects missed in the production environment by insufficient architecture design, coding implementation, and testing, keep asking, can this problem be intercepted in an earlier testing stage?

For example, cascading avalanche failures in a distributed system, unreasonable RPC Timeout + infinite retry + no concurrent Queue Depth current limit cause the wrong cycle to continue to run, and the negative feedback mechanism crushes the distributed system. The ultimate pressure test of the whole link is necessary. At the same time, for the single function test verification, it is also necessary to add an automatic case in the code access control link, so that the automatic verification current limit, retry and timeout mechanism meet the design and implementation expectations.

Principle 3: Continuously Decompose Problems

Continuous decomposition of the problem is the core principle of the test shift left. A system is disassembled into multiple subsystems, using abstract and layered methods, so that each student only faces limited information when developing, and can go deep into the View details in each subsystem.

For example, splitting complex problems into functional semantics specific to each module loosely coupled, and each module supplements its own contract test coverage. For distributed system Server hot upgrades (hot upgrades, that is, upgrades that do not affect services), the central management and control node Master must dispatch the Server process services before the upgrade, and the Server processes are responsible for service migration (old process Unload, new process Load ), the Client needs to perceive from the Master/Server that the service has been dispatched, and needs to change the Location to access, and disassemble it into Case coverage in multiple modules of the Client/Master/Server.

In the practice of testing the left shift, the following three stages are summarized:

Phase 1: Build test scaffolding

The simple and reliable CICD test framework is the infrastructure guarantee. There are many open source CI testing frameworks in the industry, such as Jenkins, GitlabCI, Travis-CI, Tekton, etc., Ali's Aone and Ant's Linkin, etc.

For the underlying block storage distributed system of IaaS, a single unit test can meet the resource requirements of 4 Core Cpu and 6GB Mem, and a single unit cannot meet the timeliness of nearly 10,000 access control case tests. The systems in the industry and the company cannot meet the requirements of distributed compilation, construction and distributed testing. Block storage implements an access control system based on Kubernetes+Jenkins self-developed. 

Figure 3 EBS CI access control system

Case resource isolation is implemented through Kubernetes, each Case exclusively occupies the container, and the CPU/Mem resource limit is set during the running of the container to avoid conflicts between Cases (for example, a memory leak of a Case causes insufficient memory, contention occurs, and CGroup may accidentally kill other Case), the Case operating environment is thrown away when it is used up to ensure the consistency of the environment, and the test (Test as a Service) is automatically triggered when the code is submitted. After ensuring that the soil can be landed, the access control system is responsible for the test operation, and the developer is responsible for the Case writing and instability Case problem solved.

Phase 2: High Frequency Test, Rapid Test

"If it hurts, do it more often" , high-frequency testing is a magic weapon for managing unstable cases. The lower the frequency of recurring problems, the harder it is to investigate. Expose as many unstable failures as possible to lower the threshold for problem investigation. Improve test running speed through distributed concurrent operation, increase build speed (incremental compilation/distributed compilation), and layered testing.

The block storage access control system is implemented based on Kubernetes, that is, it has the ability to test the horizontal expansion of concurrency. During the day, the CI system performs code access control card points and self-service construction tasks, and at night, the high-frequency hundreds of rounds return to the access control case and E2E regression. High-frequency testing greatly increases the exposure frequency of low-probability timing bugs. Overselling CPU resources in the access control system is equivalent to simulating CPU main frequency down-clocking. Low-probability data security, process crash and other defects have been found many times.

Figure 4 Number of rounds of EBS test runs at all levels

Phase 3: Unstable Case Governance

Unstable case management may be the most challenging part of the access control system. If it is not managed, it will lead to the broken window effect, and the leftward shift of the test will waste all previous efforts. Chronic diseases managed by unstable cases are more suitable for small steps and high-frequency feedback. In practice, first of all, brainstorm by organizing meetings to jointly complain about the instability of access control, establish a spontaneous consensus, principles and standards for publicity, and mobilize the masses Subjective initiative; secondly, find the key person who promotes the change, the Team Leader and the core members cooperate closely, establish a red and black list, and set up a performance star every month. Hands, data security found in unstable cases, serious flaws in Crash The team emphasizes the strong correlation between the quality of the case and the student's immediate interests, and establishes a rigid standard: a failure commmit with a probability of more than 5% will be reverted within one day. Regarding the governance of unstable cases, Google and Microsoft published papers in 2016 and 2017, Google paper  [5] and Microsoft paper  [6]  .

The unstable case of block storage governance is mainly through high-frequency testing and cultural evangelism. The block storage access control system is full-scale stuck, that is, if any Case fails, code submission will be blocked. With the increase of Cases, the requirements for the stability of Cases are increasing day by day. If there are 3 Cases with a pass rate of 90% among the tens of thousands of Cases in the access control system, the overall pass rate will be 72%. High-frequency exposure recurrence, for the repaired Case to verify the repair effect. A consensus was reached from top to bottom within the team: priority of production issues > priority of unstable cases > priority of new feature project development. The figure below shows the number of regression failure rounds of the unstable Case Top15 of block storage in the past six months. The numerator is the number of failure rounds, and the denominator is the number of single-day operation rounds. The higher the failure ratio, the darker the color.

Figure 5 EBS Access Control Top 15 Unstable Case Pass Rate

Many challenges will be encountered during the implementation of the test left shift. It is difficult to integrate knowledge and action, and it is even more difficult for the entire team to abide by the same standards and specifications. The block storage system has millions of lines of code. In the past year, the number of production code lines has increased by about 20%, and the number of test code lines has increased by about 100%. Nearly 10,000 access control cases have been managed continuously, and the pass rate of access control has increased from 4.7%. To 70%, follow-up plans to further increase the pass rate of access control through precise testing.

Look up to the sky and keep your feet on the ground. It is recommended to learn TDD (Test-Driven Development) in the coding practice of testing the left shift. "Professionalism"... In the books of foreign leading programmers, all of them recommend TDD (Test Driven Development). TDD is not a panacea. The main thinking mode is to first think about the behavior of the system, and then start coding. After the test is clear, the performance of the developed API/system will be clear, and the semantics of the API/function/method will be clear. How to measure the quality of the test? A good test is What, including Given When Then; a bad test is How. If the test must be completely rewritten after each method/function modification, it may be necessary to reconsider the test implementation and the design structure of the system itself.

further reading

[1]  Code access control

[2] RethinkDB:why we failed

[3]  Behind Alibaba Cloud's first million IOPS cloud disk

[4]  Business coverage

[5]  Google Papers

[6]  Microsoft paper

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/124813985