For the first time, the full-link stress test is launched for the B end! Tao Department's high-difficulty pressure test practice plan is public

background


"This year's Double 11 is a super explosion of the world's largest content shopping malls. Every second there is a fierce resonance between consumers, technology, content and business ecology. The superposition of real-time, complexity and continuous peaks makes it a global Technology peak. On Double 11 in 2020, Alibaba’s peak transactions reached 583,000 transactions per second, and the merchant links behind it are also under unprecedented pressure.” Alibaba Vice President Tang Xing described this year’s Double 11 in this way.

The business domain of Ali merchants involves nearly 20 business lines of the group, more than 100 scenarios, and 400+ links. Some businesses are deeply integrated into shopping guide and transaction links. The merchant link business covers tens of millions of merchants and hundreds of partners of core third-party service providers. In the past, due to the “water level difference” between the platform and the merchant’s IT infrastructure, it was difficult for the platform to help the merchant in system transformation and upgrade and full link acceptance. As Alibaba has launched a new generation of digital infrastructure system for merchants and ecological partners, with the help of cloud-native technology engines and cloud IT governance capabilities to help merchants and ecological partners rebuild the system's high-availability baseline, more and more merchants have Ali's ultra-large-scale data processing capabilities of the same magnitude.

In conventional preparations, each business relies on single-link stress testing, lacks interoperability, and lacks overall control. It is easy to have quality blind spots, hidden dangers are easy to be ignored, and risks are high. Problems in the merchant link directly affect the merchant's production and operation capabilities and cause significant experience damage to merchants and consumers. The industry's full-link stress testing is generally oriented to the C-side scenario, and the B-side scenario is often not paid enough attention. The scene structure on the merchant side is extremely complex, involving internal systems and many tripartite systems, and carrying out full-link stress testing faces many challenges. In order to effectively prevent risks, improve the system stability of various businesses and tripartite partners, and improve user experience, this year we have overcome numerous difficulties and carried out full-link stress testing for merchant links for the first time.


challenge



▐ Core difficulties


Merchant business covers a variety of business forms such as merchant tools, messaging, multimedia, and algorithms. The calls between applications are intricate and complex. How to sort out the dependence of the core link and ensure that the core link is not missed is the primary challenge for conducting stress testing.

In the implementation of stress testing, how to simulate the superposition and coupling of hundreds of scenes of traffic, and leave room for operation to deal with emergencies, and at the same time, to pull the traffic to the peak in batches is a major challenge for the implementation of stress testing.

Merchant business involves the systems of many partner service providers. The complexity of the Alibaba ecosystem makes the system architecture of service providers connected to the Alibaba platform have various technical challenges such as heterogeneity, diversification, and large performance differences. Each service provider is There are differences in sexual perception.

▐ Solution


In response to the above challenges, our core solutions are:

  • Uniform process specifications, coordination of stress testing organizations, and unified acceptance review.

  • Tool support pressure measurement integration.

  • Stress testing covers plans, current limiting, drills, monitoring, etc., to simulate the real situation of big promotion.


Full link stress test solution

1. Unified process specification

Each business participates in the full-link stress test to clarify the access and access standards. The basic admission principle of full-link stress test: In each scenario, the single-link stress test can enter the full-link stress test. Quasi-out standards include: system water level, flow rate, response time, cache hit rate, current limit, jvm indicators, etc. The service provider system is consistent with the Ali system standards.

In the process, scene review, link review, full-link pressure test and review are carried out uniformly, and the problems in the review are optimized and solved simultaneously.

2. Tool support integration

Based on the existing stress testing capabilities of the group, we focus on building a one-stop stress testing tool in merchant link analysis, traffic assessment, stress testing model, result verification automation, etc., to solve the core problems of merchants’ full link stress testing .


Link analysis

Basic principles for selecting stress test scenarios:

  • High traffic scenario

  • Core link scenario

  • Complex business link scenarios

  • Traffic diffusion scenario.

With reference to the above principles, we developed a link analysis scenario recommendation tool based on the support of the group's middleware, and used manual + tool check to generate the link model. The schematic principle is as shown in the figure below. Using a call of a certain interface of A, we can obtain the trace of its downstream application, the database or cache that it accessed, and analyze the amplification of downstream traffic by one call.

Link diagram

The workflow is as follows:

  • Statistic traffic calls for services, storage and other indicators, obtain the number of downstream calls, and get the amplification of downstream requests.

  • Obtain ingress links whose traffic is greater than a certain threshold.

  • Get the depth of the call link or the link whose call is dependent on greater than a certain threshold.

  • Obtain links where the double call is greater than a certain threshold.

Flow assessment

Conventional traffic assessment is usually based on monitoring and upstream calls to assess interface traffic. The assessment is ideal and prone to errors. If the initial flow rate is not adequately estimated, the pressure measurement will fail to reach the target or far exceed the target value. It is too costly to rely on pressure measurement to find these problems. In this regard, we have introduced deep machine learning, using algorithm capabilities to make intelligent predictions, and more accurate traffic evaluation. Its main working principle is as follows:

  • Grab the application's entrance and exit flow, RT, QPS, error rate, application system water level and other information.

  • Apply deep machine learning to analyze and learn basic data to produce an algorithm model for the application. The model can estimate the outlet flow and the water level of the application system based on the inlet flow. The effect is shown in the figure.

Interface traffic simulation comparison chart

Its core ideas are:

  • The algorithm continuously learns and optimizes according to online real data every day to generate reliable models.

  • The user inputs the estimated flow of the interface, and the platform outputs the estimated downstream interface flow, storage QPS, number of machines, and system water level.

 Model building

Combining business scenarios, we have established two stress test models: 0-point model and non-zero-point model. The 0-point model involves a scenario where the traffic peaks at 0:00, while the non-zero-point model scenario peaks at other times. The design of the stress test model is not a simple coverage service, but also needs to consider the flow coupling between various services. For links with upstream and downstream relationships, downstream pressure measurement traffic also comes from upstream. In addition to considering the traffic coupling within the merchant domain, the traffic coupling of other domains in the group should also be considered. Therefore, there are a lot of pressure test plans to control the admission and exit of pressure test flow.

If there is a flow coupling, it is necessary to consider the insufficient coupling flow. Not every time the pressure is measured upstream, there will be sufficient flow to the downstream. While ensuring that the upstream traffic can be coupled, the downstream business itself must have the ability to supplement traffic. When the traffic from the upstream is insufficient, it can supplement enough traffic to meet the verification of its own system.

3. Pressure test execution

The pressure test data is prepared to use the Phoenix platform [see appendix 1] self-developed by the Technical Quality Department of Tao Department to record online traffic for "panning", and the scene data is more realistic and effective. Recording also improves the efficiency of data preparation, recording once and using it multiple times. In scene management, all scenes can be executed by one person and one key, which greatly saves manpower. Each full-link stress test of merchants, transactions, and shopping guides will be carried out simultaneously, and all 0-point traffic will be hit to 100% to verify the cross-business domain traffic dependence and superposition. The time from the start of the pressure test to when the flow rate reaches 100% is very limited. In the early stage of the pressure test, there are usually many problems in this process. These problems are difficult to find in the preliminary preparation process. The main problems are as follows:

  1. Uneven pressure distribution.

  2. Model urgent adjustment.

  3. The data file is urgently corrected.

  4. Individual pressure testing machine is abnormal.

  5. Affected by pressure measurement and control, service invocation is limited.

  6. Emergency management and control of stress testing tasks.

Executives need to control unaffected businesses to continue to pressurize as planned, and they need to quickly make judgments on problems that arise and provide solutions. For scenes that can be quickly corrected, first remove them from the stress test activity separately, and re-associate them with the stress test task after the adjustment is completed. For scenes that cannot be quickly corrected, it is necessary to quickly coordinate business students to perform single-link operations. Affected by the above problems, different response strategies need to be formulated on site, mainly including the following situations:

  • During the stress test, one scene needs to be unloaded urgently, and other scenes remain.

  • During the stress testing process, a certain scenario did not meet the target, but system resources have already become bottlenecked. It is necessary to maintain the pressure level to troubleshoot problems, and the rest of the business continues to pressurize.

  • When the pressure is pulled to 100%, individual businesses do not meet expectations and need to continue to increase pressure.

This type of problem occurs very frequently during the pre-stress testing process. If the pre-preparation is not in place, it will block the pressure testing rhythm.

In addition, the full-link stress test has a complete monitoring system from the client (success rate, error amount, public opinion, etc.) to the server (interface success rate, middleware success rate, water level, etc.), avoiding only paying attention to service availability and ignoring customers The end and the user experience the blind spot that is impaired.

In the full-link stress test scenarios of hundreds of merchants, messages, open programs, and small programs are a few typical unconventional stress test scenarios, which are difficult to implement. Next, I will focus on the full-link stress test of these scenarios. Test practice.


▐Core  scenario 1: End-to-end full link pressure test of IM messaging system 

The traditional server-side pressure test is usually a short link pressure test of http or tcp type, that is, the client requests once, and the connection is closed after the return. However, this method is not suitable for the pressure test of the IM messaging system. The IM system establishes a long connection with the server. When user A sends a message to user B, user B passively receives the push from the server instead of actively Pulling data, this mode is very challenging for the implementation of stress testing.

In order to solve this problem, we developed the thin clients of Shoutao and Qianniu based on NIO, which simulated real users to maintain a long connection with the server on the server, and parameterized part of the business logic and integrated them into the thin client. For example, after receiving the message push, the thin client replies with an ACK, indicating that the message has been received, and the message has been read according to the read ratio.


Long connection thin client

The message link pressure test needs to simulate the client to keep a long connection after logging in. At present, the conventional pressure test tools in the industry are not applicable to the message scenario, and independent development tools are required. The core program is as follows:

  • Develop a thin client integrated into the stress test engine to simulate end-to-end scenarios

    • Simulate message reply

    • Simulated message has been read

    • Simulated message roaming

  • The stress testing engine is deployed on the CDN node to simulate real users sending and receiving messages

  • Create a long connection to simulate the user's real long connection online

Long connection pressure test architecture diagram

Full link message service opened

In addition to the full message link, which includes messages up and down, messages that have been read and not read, message content models (text, cards, multimedia messages), message roaming and other basic message services, the two parties of the group and the three partners have also derived such basic message services as robots. Intelligent assistance, order checking, address modification, intelligent customer service and other messaging services.

Message business graph

Due to the complexity of the message business, we need to coordinate the pressure test data of each business party during the stress test to adjust the pressure of each business party during the actual stress test. For example, a certain account has activated service A and service B, and the message sent to the account has reached two services, causing the traffic to be superimposed, and the final stress test result is inaccurate. In order to solve this problem, we have unified the message service stress test scripts, and solved a series of problems such as the mutual interference of traffic superimposition and the inability to reuse scripts of various business lines by uniformly assigning accounts, opening services, and initiating traffic uniformly. .

▐Core  Scenario 2: Three-party service provider full link stress test 

Taoties e-commerce business is an ecological link covering consumers, platforms, merchants, and three-party service providers. With the help of the platform's open interface, three-party service providers can develop e-commerce support tools for customer management, order management, and interactive marketing to help businesses improve operational efficiency. To realize the full-link guarantee for merchants, it is necessary to drive and empower three-party service providers to participate in the full-link stress test.

Amoy e-commerce ecological interaction diagram


Order push full link pressure test

The core of the many scenarios involving three parties is the order scenario. The entire order push link covers the entire process from when consumers place an order on Taobao to receive the goods. The link covers multiple internal and external systems, directly related to the shopping experience of consumers, and the importance of stress testing is self-evident.

Order push full link interaction diagram

The full link pressure test of order push faces the following pain points:

    • There are a large number of service providers, and there is no unified stress testing tool.

    • The pressure test order model is difficult to construct and has poor accuracy.

    • It is difficult to collect stress test results, the data analysis workload is large, and the standards are not uniform.

To this end, we have developed an order push full-link stress test platform to empower service providers, the platform has the following characteristics:

One-click generation of pressure test models based on historical order data, convenient and accurate

Users only need to specify the start and end time of historical orders, and the system will automatically analyze the online order data during that time period, and simulate and generate a stress test order model based on the information of the order buyer, discount, and product.

Order push pressure test model

Automatic collection and analysis of pressure test result data, comprehensive and intuitive

The tool will automatically collect relevant pressure measurement indicators from the monitoring data, compare it with historical data in the report analysis stage, calculate the pressure measurement quality score according to the pressure measurement standard, produce pressure measurement conclusions, and give optimization suggestions.

Schematic diagram of order push pressure test report generation


Supports Service Mesh technology, making stress testing in production environment easier and safer

The stress test platform supports the cloud-native Service Mesh technology. For the stress test transformation of the production environment, only the relevant configuration of the POD node needs to be modified, and there is no need to modify the system business code, which greatly reduces the stress test cost of the production environment and reduces the possibility of code transformation. Security Risk

▐Core  scenario 3: Three-party applet pressure test 

Mini Programs are an important form of Taoxi’s opening on the APP side. Since the three-party mini programs are integrated into Taoxi APPs such as Shoutao and Qianniu, their stability is related to the stability and user experience of Taoxi APP. Due to technical and security constraints, third-party service providers cannot independently perform online stress testing on the mini program interface. For this reason, we have developed a mini program stress testing platform to provide service provider partners with efficient and easy-to-use official stress testing tools.

The three-party applet stress testing platform is simple to use, easy to use, and highly automated, which greatly reduces the technical threshold and the stress testing cost of service providers. Its core capabilities are as follows:

  • The system automatically analyzes, creates, and issues pressure test tasks based on the online interface traffic.

  • The service provider only needs to edit the pressure test parameters and perform the pressure test task.

  • After the pressure test is passed and the report is submitted, the system will set the current limit protection value according to the pressure test flow value.

Three-party applet pressure test flowchart

In addition, we also organized a joint preview of the head core service providers. Through full-link stress testing and simulating service peak traffic, various emergency plans and communication mechanisms were exercised to deal with abnormalities and emergencies that may occur during the big promotion, forming a complete guarantee link for stress testing, network closure and rehearsal :

  • At the same time as the full link pressure test, the core three-party applet is tested to simulate a more realistic Double Eleven environment.

  • Perform abnormal situation drills at the same time of stress testing: artificial simulation introduces problems and faults.

  • Carry out emergency treatment of problems and failures, exercise communication, coordination mechanisms and emergency handling procedures.


to sum up


A total of 130+ internal system problems were found during the full-link stress test, and all problems before the big promotion were effectively solved. During the entire promotion period, the merchant business system 0 failures and merchant problem feedback decreased by 50% compared with previous years. Merchants and consumers felt silky smooth, safeguarding the interests and experience of merchants.

More than one hundred core ISVs participated in the acceptance of the full-link stress test. The full-link stress test helped merchants and ecosystem partners reconstruct the system's high-availability baseline, enabling merchants to compete with Taobao at the level of information processing and data processing. At the same water level, merchants have the same ultra-large-scale data application capabilities as Ali, helping merchant systems find and repair thousands of performance problems, and greatly improve the stability of merchant systems in terms of efficiency and capabilities.

In the future, we plan to continue to optimize in three areas to provide merchants and partners with better services:

1. Intelligent pressure measurement based on algorithm capabilities.

2. Comprehensive automated analysis of scenarios, plans, current limiting, and pressure test results.

3. Multi-dimensional monitoring, intelligent analysis and positioning processing of massive alarm information.

 Attachment 1:

Phoenix Ecology utilizes the unified underlying capabilities provided by JVM-Sandbox, and everyone wants to create a module atomic capability ecology, providing open and fast module development, management, authentication, and deployment capabilities (Magic's Cube platform). Through the combination of module capabilities, It derives high-level product modules (Phoenix Platform 2.0) such as recording and playback, fault injection, strong and weak dependency combing, system mocks, rapid problem location, and test quality evaluation (Phoenix Platform 2.0), and opens APIs and capabilities to the data and standard capabilities generated in each link. Output solutions in areas such as business regression, offensive and defensive drills, and architecture governance are committed to quickly and efficiently improving the overall stability of the system.

Open source address:

https://github.com/alibaba/jvm-sandbox-repeater?spm=ata.13261165.0.0.11ee30bfW4qEoF

JVM-Sandbox belongs to the AOP framework based on Instrumentation's dynamic weaving class. By carefully constructing bytecode enhancement logic, the sandbox module can achieve non-intrusive runtime AOP interception of target application methods without violating JDK constraints.

Open source address:

https://github.com/alibaba/JVM-Sandbox?spm=ata.13261165.0.0.5a094b01By8WRH

Tao Department Technical Department-Quality Team-Recruiting Talents

Responsible for ensuring the business quality of the entire Taobao and Tmall main stations, where there are rich and diverse business scenarios and technical challenges. Here you can understand how the world-class double eleven is guaranteed. Facing the massive peak traffic on double eleven, the most challenging big promotion stability guarantee product is deposited. Here you can escort interactive products with more than 100 million DAU. The industry’s most complex marketing gameplay and technical charm. Here you can get a close look at the current hot content e-commerce companies, see how Li Jiaqi, Wei Ya, etc. became popular stocks, and here you can also explore the front-line business and growth driven by big data Strategies to build new tracks in the field of e-commerce under the blessing of new technologies such as 3D, AI, and 5G.

Here you will also work with a group of excellent partners. Here is the ultimate pursuit of technology, using the industry’s most cutting-edge R&D technologies and concepts, innovating quality assurance methods, tools and platforms, improving R&D testing efficiency, and constantly improving user experience. Driven by technology, build an industry-leading quality system. Any of our optimization and improvement will benefit hundreds of millions of users. We look forward to your joining. Welcome to join us to build the technical quality of Taoxi together! Contact: [email protected]

✿ Further   reading

Author| Dahai, Trace

Edit| Orange

Produced| Alibaba's new retail technology

Guess you like

Origin blog.csdn.net/Taobaojishu/article/details/111306348