E-commerce spike - Alibaba solution

[Introduction] Xu Hanbin has been engaged in technology research and development work in Alibaba and Tencent for more than 4 years, responsible for the upgrade and reconstruction of the Web system with a daily request volume of over 100 million. Currently, he is starting a business in Xiaoman Technology and is engaged in the construction of SaaS service technology. 


E-commerce spikes and snap-ups are not unfamiliar to us. However, from a technical point of view, this is a huge test for Web systems. When a web system receives tens of thousands or even more requests in one second, the optimization and stability of the system are crucial. This time, we will focus on the technical implementation and optimization of spikes and snap-ups. At the same time, from the technical level, we will uncover the reasons why it is not easy for us to grab train tickets? 

1. Challenges brought by large-scale concurrency 

In the past work, I once faced the high concurrency spike function of 5w per second. During this process, the entire web system encountered many problems and challenges. If the Web system is not optimized in a targeted manner, it will easily fall into an abnormal state. Let's discuss the ideas and methods of optimization together now. 

1. Reasonable design of request interface

A seckill or snap-up page is usually divided into two parts, one is static HTML and other content, and the other is the web background request interface that participates in the seckill.

Usually, content such as static HTML is deployed through CDN. Generally, there is little pressure. The core bottleneck is actually on the background request interface. This back-end interface must be able to support high concurrent requests. At the same time, it is very important that it must be as "fast" as possible, returning the user's request result in the shortest time possible. To achieve this as fast as possible, it would be better for the backend storage of the interface to use memory-level operations. It is not suitable to directly face storage such as MySQL. If there is such a complex business requirement, it is recommended to use asynchronous writing.

 

Of course, there are also some seckills and rush purchases that use "lag feedback", which means that the seckill does not know the result at the moment, and it can only be seen from the page whether the user has succeeded in the seckill after a period of time. However, this is a "lazy" behavior, and at the same time, the user experience is not good, and it is easily considered by users as "black box operation".

2. The challenge of high concurrency: must be "fast"

We usually measure the throughput rate of a web system by QPS (Query Per Second, the number of requests processed per second), which is very critical to solve high concurrency scenarios of tens of thousands of times per second. For example, we assume that the average response time for processing a business request is 100ms, and at the same time, there are 20 Apache web servers in the system, and the MaxClients is configured to 500 (representing the maximum number of Apache connections).

Then, the theoretical peak QPS of our web system is (ideally calculated):

20*500/0.1 = 100000 (10万QPS)

Huh? Our system seems to be very powerful, it can handle 100,000 requests in 1 second, and the 5w/s spike seems to be a "paper tiger". The actual situation, of course, is not so ideal. In the actual scenario of high concurrency, the machines are in a state of high load, and the average response time will be greatly increased at this time.

As far as the web server is concerned, the more connection processes Apache opens, the more context switches the CPU needs to handle, which increases CPU consumption and directly increases the average response time. Therefore, the above-mentioned number of MaxClients should be comprehensively considered according to hardware factors such as CPU and memory, and the more the better. You can test it through the abench that comes with Apache and take an appropriate value. Then, we choose Redis for memory operation level storage. In a state of high concurrency, the response time of the storage is critical. Although network bandwidth is also a factor, such request packets are generally small and rarely become the bottleneck of requests. It is rare for load balancing to become a system bottleneck, so I will not discuss it here.

Then the problem comes, assuming our system, in the high concurrency state of 5w/s, the average response time changes from 100ms to 250ms (the actual situation, or even more):

20*500/0.25 = 40000 (4万QPS)

As a result, our system has 4w of QPS left, and in the face of 5w of requests per second, there is a difference of 1w in the middle.

Then, this is where the real nightmare begins. For example, at a high-speed intersection, 5 cars come in 1 second, and 5 cars pass through every second, and the high-speed intersection works normally. Suddenly, only 4 vehicles can pass through this intersection in 1 second, and the traffic flow is still the same. As a result, there must be a big traffic jam. (5 lanes suddenly become 4 lanes)

Similarly, in a certain second, 20*500 available connection processes are working at full load, but there are still 10,000 new requests, no connection process is available, and the system is expected to fall into an abnormal state.

 

In fact, in normal non-highly concurrent business scenarios, a similar situation occurs. There is a problem with a business request interface, and the response time is extremely slow. The response time of the entire web request is prolonged, and the number of available connections to the web server is gradually increased. Occupied, other normal business requests, no connection process is available.

The more terrifying problem is that it is the behavioral characteristics of users. The more unusable the system is, the more frequently users click, and the vicious circle eventually leads to an "avalanche" (one of the web machines hangs, causing traffic to be scattered to other machines that are working normally. , and then cause the normal machine to hang, and then a vicious circle), dragging down the entire Web system.

3. Restart and overload protection

If an "avalanche" occurs in the system, restarting the service rashly will not solve the problem. The most common phenomenon is that it hangs up immediately after starting up. At this time, it is best to deny the traffic at the ingress layer, and then restart it. If the service like redis/memcache also hangs, you need to pay attention to "warm-up" when restarting, and it may take a long time.

In the scene of spike and panic buying, the traffic is often beyond our system's preparation and imagination. At this time, overload protection is necessary. Rejecting requests is also a safeguard if a full system load state is detected. Setting up filtering on the front end is the easiest way, however, this practice is a behavior that is "pointed by thousands of people" by users. It is more appropriate to set the overload protection at the CGI entry layer to quickly return the direct request of the client.

Second, the means of cheating: offense and defense

Lightning strikes and snap-ups have received "massive" requests, but in fact the moisture inside is very large. Many users, in order to "grab" the goods, use auxiliary tools such as "ticket brushing tools" to help them send as many requests to the server as possible. There are also some advanced users who make powerful automatic request scripts. The reason for this is also very simple, that is, in the requests to participate in the flash sale and snap-up, the more requests you make, the higher the probability of success.

These are all "cheating methods", but if there is "offense", there is "defense". This is a battle without gunpowder smoke.

1. The same account, make multiple requests at one time

Some users send hundreds or even more requests at a time with their own accounts through browser plug-ins or other tools during the start of the Lightning Deal. In fact, such users undermine the fairness of spikes and snap-ups.

This kind of request may also cause another kind of damage in some systems without data security processing, resulting in some judgment conditions being bypassed. For example, a simple claim logic, first determine whether the user has a participation record, if not, the claim is successful, and finally write it into the participation record. This is a very simple logic, but in high concurrency scenarios, there are deep loopholes. Multiple concurrent requests are distributed to multiple web servers on the intranet through the load balancing server. They first send query requests to the storage, and then, within the time difference between a request being successfully written to a participating record, other requests get the query results. All are "No Participation Records". Here, there is a risk of logical judgment being bypassed.

 


Response plan:

At the entry of the program, an account is only allowed to accept one request, and other requests are filtered. It not only solves the problem of sending N requests for the same account, but also ensures the security of subsequent logical processes. The implementation scheme can write a flag bit through the memory cache service of Redis (only 1 request is allowed to be successfully written, combined with the optimistic locking feature of watch), and those successfully written can continue to participate.

 

Or, implement a service yourself, put requests from the same account into a queue, process one, and then process the next.

2. Multiple accounts, send multiple requests at one time

The account registration function of many companies is almost unlimited in the early stage of development, and it is easy to register many accounts. Therefore, it has also led to the emergence of some special studios, which have accumulated a large number of "zombie accounts" by writing automatic registration scripts, ranging from tens of thousands or even hundreds of thousands of accounts, specializing in various brushing behavior ( This is the source of the "zombie fans" in Weibo). For example, if we use tens of thousands of "zombie accounts" to mix in and forward the lottery on Weibo, we can greatly increase the probability of our winning the lottery.

This kind of account is also used in spikes and snap-ups for the same reason. For example, the rush to buy the iPhone official website, train ticket scalpers.

 

Response plan:

This kind of scenario can be solved by detecting the IP request frequency of the specified machine. If you find that a certain IP request frequency is very high, you can pop up a verification code for it or directly prohibit its request:

  1. The core pursuit of pop-up verification code is to identify real users. Therefore, you may often find that some of the verification codes that pop up on the website look like "ghosts and gods dancing", and sometimes we can't see them clearly at all. The reason why they do this is actually to prevent the image of the verification code from being easily recognized, because a powerful "automatic script" can recognize the characters in the image through the image, and then let the script automatically fill in the verification code. In fact, there are some very innovative verification codes that work better, such as giving you a simple question for you to answer, or allowing you to complete some simple operations (such as Baidu Tieba's verification code).
  2. Directly banning IP is actually a bit rude, because some network scenarios of real users happen to be the same exit IP, and there may be "accidental injuries". However, this method is simple and efficient, and can be used according to actual scenarios to achieve good results.

3. Multiple accounts, different IPs send different requests

The so-called Tao is one foot high, and the devil is one foot high. If there is an offense, there will be a defense, and it will never stop. After these "studios" found that you have control over the frequency of single-machine IP requests, they also came up with their "new attack plan" for this scenario, which is to constantly change the IP.

 

Some students will be curious about how these random IP services come from. Some organizations occupy a batch of independent IP, and then make a random proxy IP service, which is provided to these "studios" for a fee. There are also some darker ones, which are to hack ordinary users' computers through Trojans. This Trojan does not destroy the normal operation of users' computers. It only does one thing, which is to forward IP packets, and ordinary users' computers are turned into IP addresses. Agent export. In this way, hackers get a lot of independent IP, and then build it as a random IP service, just to make money.

Response plan:

To be honest, the request in this scenario is basically the same as the behavior of a real user, and it is difficult to distinguish. It is easy to "accidentally hurt" real users by making further restrictions. At this time, you can usually only limit such requests by setting a high business threshold, or clear them in advance through "data mining" of account behavior.

Zombie accounts also have some common characteristics, such as accounts that are likely to belong to the same number segment or even consecutive numbers, low activity, low level, incomplete information, and so on. According to these characteristics, set the participation threshold appropriately, for example, limit the account level to participate in Lightning Deals. Through these business methods, it is also possible to filter out some zombie numbers.

4. rush to buy train tickets

Seeing this, do the students understand why you can't get the train ticket? It's really hard if you just honestly go grab tickets. Through the multi-account method, the scalpers of train tickets occupy a lot of tickets, and some powerful scalpers are even more "skilled" in processing verification codes.

When an advanced scalper swipes a ticket, a real person is used to identify the verification code, and a relay software service is built in the middle to display the verification code picture. The real person browses the picture, fills in the real verification code, and returns it to the relay software. For this method, the protection and restriction of the verification code is abolished, and there is no good solution at present.

 

Because train tickets are based on identity verification, there is also a transfer operation method for train tickets. The general method of operation is to first use the buyer's ID card to open a ticket grabbing tool, continue to send requests, the scalper account chooses to refund the ticket, and then the scalper buyer successfully purchases the ticket through his ID card. When a train runs out of tickets, there are not many people staring at it. Besides, the scalpers have a very powerful ticket grabbing tool. Even if we see a refund, we may not be able to grab them. 

 

In the end, the scalper successfully transferred the train ticket to the buyer's ID card.

solution:

There is no good solution. The only thing that can be tempted is to perform "data mining" on account data. These scalper accounts also have some common characteristics, such as frequent ticket grabbing and refunds, and unusually active holidays. Analyze them for further processing and screening.

3. Data security under high concurrency

We know that when multiple threads write to the same file, there will be a problem of "thread safety" (multiple threads run the same piece of code at the same time, if the result of each running is the same as the result of single-threaded running, the result is as expected the same, that is thread-safe). If it is a MySQL database, you can use its own lock mechanism to solve the problem. However, in large-scale concurrent scenarios, MySQL is not recommended. In the scene of instant kill and panic buying, there is another problem, that is, "over-sending". If you do not control this aspect carefully, it will cause too much sending. We have also heard that some e-commerce companies engage in panic buying activities. After the buyer successfully takes the photo, the merchant does not recognize the validity of the order and refuses to deliver the goods. The problem here may not necessarily be the treacherous merchants, but the over-issue risk at the technical level of the system.

1. Reasons for over hair

Suppose in a panic buying scenario, we only have 100 items in total. At the last moment, we have consumed 99 items and only the last one is left. At this time, the system sent multiple concurrent requests, and the balance of goods read by this batch of requests were all 99, and then all passed the judgment of the balance, which eventually led to over-delivery. (Same as the scene mentioned earlier in the article)

 

In the above picture, the concurrent user B also "successfully snapped up", allowing one more person to obtain the product. This scenario is very easy to appear in the case of high concurrency.

2. Pessimistic locking idea

There are many ideas for solving thread safety, which can be discussed from the direction of "pessimistic locking".

Pessimistic locking, that is, when modifying data, the lock state is used to exclude the modification of external requests. In the locked state, you must wait.

 

Although the above solution does solve the problem of thread safety, don't forget that our scenario is "high concurrency". That is to say, there will be many such modification requests, each of which needs to wait for a "lock", and some threads may never have a chance to grab this "lock", and such requests will die there. At the same time, there will be many such requests, which will instantly increase the average response time of the system. As a result, the number of available connections will be exhausted and the system will fall into an exception.

3. FIFO queue idea

That's good, then let's slightly modify the above scene, we directly put the request into the queue, using FIFO (First Input First Output, first in first out), so that we will not cause some requests to never be obtained Lock. Seeing this, does it feel a little bit forced to turn multi-threading into single-threading?

 

Then, we now solve the lock problem, and all requests are processed in a "first-in, first-out" queue. Then there is a new problem. In a high concurrency scenario, because there are many requests, the queue memory may be "exploded" in an instant, and then the system will fall into an abnormal state again. Or designing a huge memory queue is also a solution, but the speed at which the system processes requests in a queue cannot be compared with the number of frantically pouring into the queue. That is to say, the requests in the queue will accumulate more and more, and eventually the average response time of the Web system will still drop significantly, and the system will still fall into an exception.

4. Optimistic locking idea

At this time, we can discuss the idea of ​​"optimistic locking". Optimistic locking is a looser locking mechanism than "pessimistic locking", and most of them are updated with a version number. The realization is that all requests for this data are eligible to be modified, but a version number of the data will be obtained. Only the version number that matches can be updated successfully, and the others will fail to return to snap-up. In this case, we don't need to consider the queue problem, but it will increase the computational overhead of the CPU. However, on the whole, this is a better solution.

 

There are many software and services that support "optimistic locking", for example, the watch in Redis is one of them. With this implementation, we keep the data safe.

4. Summary

The Internet is developing rapidly, and the more users use Internet services, the more high-concurrency scenarios become. E-commerce spikes and snap-ups are two typical high-concurrency scenarios on the Internet. Although our specific technical solutions to solve problems may vary widely, the challenges encountered are similar, so the ideas for solving problems are also similar.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326220330&siteId=291194637