Improper use of Redis distributed locks caused a major accident, oversold 100 bottles of Feitian Moutai! ! !

The use of distributed locks based on Redis is nothing new nowadays.

This article is mainly based on the analysis and solutions of accidents caused by redis distributed locks in our actual projects. The snap-up orders in our project are resolved by distributed locks. Once, the operation performed a snap-up campaign for Feitian Moutai. 100 bottles were in stock, but 100 bottles were oversold! You know, the scarcity of flying Maotai on this earth! ! !

The accident is classified as a P0 major accident... can only be accepted frankly. The performance of the entire project team was deducted~~ After the accident, the CTO named me by name and asked me to take the lead to deal with it.

Okay, rush~

accident scene

After some understanding, I learned that this panic buying activity interface has never happened before, but why is it oversold this time?
The reason is that the previous rush-buying products were not scarce products, but this event was actually Feitian Maotai. Through the analysis of the buried point data, all the data have basically doubled, and the enthusiasm of the activity can be imagined! Not much to say, go directly to the core code, and the confidential part is treated with pseudo-code. . .

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
    
    
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId;
    try {
    
    
        Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val", 10, TimeUnit.SECONDS);
        if (lockFlag) {
    
    
            // HTTP请求用户服务进行用户相关的校验
            // 用户活动校验
            
            // 库存校验
            Object stock = redisTemplate.opsForHash().get(key+":info", "stock");
            assert stock != null;
            if (Integer.parseInt(stock.toString()) <= 0) {
    
    
                // 业务异常
            } else {
    
    
                redisTemplate.opsForHash().increment(key+":info", "stock", -1);
                // 生成订单
                // 发布订单创建成功事件
                // 构建响应VO
            }
        }
    } finally {
    
    
        // 释放锁
        stringRedisTemplate.delete("key");
        // 构建响应VO
    }
    return response;
}

The above code ensures that the business logic has sufficient execution time through the expiration time of the distributed lock with a validity period of 10s; the try-finally statement block is used to ensure that the lock will be released in time. The inventory is also verified within the business code. It looks very safe~ Don't worry, continue to analyze. . .

If you need more interview materials from major companies, you can also click to enter directly and get it for free! Password: CSDN

Cause of the accident

The Feitian Maotai snap-up activity attracted a large number of new users to download and register our APP. Among them, there are many wool parties who use professional methods to register new users to collect wool and brush orders. Of course, our user system is prepared in advance, and access to Alibaba Cloud human-machine verification, three-factor authentication, and self-developed risk control system and other 18 martial arts has blocked a large number of illegal users. I can’t help but like it here~ But because of this, the user service has been under a high operating load.

At the moment when the panic buying activity started, a large number of user verification requests hit the user service. The user service gateway has a short response delay. The response time of some requests exceeds 10s, but because the response timeout of the HTTP request is set to 30s, this causes the interface to be blocked for user verification. After 10s, distributed The lock has expired. At this time, a new request can get the lock, which means that the lock is overwritten. After these blocked interfaces are executed, the logic of releasing the lock will be executed, which releases the locks of other threads, causing new requests to compete for the lock. This is really an extremely bad cycle. At this time, we can only rely on inventory verification, but inventory verification is not non-atomic. The method of get and compare is used. The tragedy of oversold happened like this~~~

Accident analysis

After careful analysis, it can be found that this snap-up interface has serious security risks in high concurrency scenarios, which are mainly concentrated in three places:

  • There is no other system risk fault-tolerant handling.
    Due to the tight user service, the gateway response is delayed, but there is no way to deal with it. This is the fuse for oversell.
  • The seemingly safe distributed lock is actually not safe at all.
    Although the method of set key value [EX seconds] [PX milliseconds] [NX|XX] is used, if the thread A executes for a long time before it can be released, the lock will expire. At this time, thread B can acquire the lock. When thread A finishes executing, releasing the lock actually releases the lock of thread B. At this time, the thread C can acquire the lock again, and at this time, if the thread B finishes executing the lock release, it is actually the lock set by the released thread C. This is the direct cause of oversold.
  • Non-atomic inventory verification
    Non-atomic inventory verification leads to inaccurate inventory verification results in concurrent scenarios. This is the root cause of oversold.

Through the above analysis, the root cause of the problem is that inventory verification relies heavily on distributed locks. Because in the case of normal set and del of distributed locks, there is no problem with inventory verification. However, when distributed locks are not safe and reliable, inventory verification is useless.

solution

After knowing the reason, we can prescribe the right medicine.

Realize relatively safe distributed locks

Relatively safe definition: set and del are mapped one by one, and there will be no other existing lock del. From the perspective of the actual situation, even if the set and del one-to-one mapping can be achieved, the absolute security of the business cannot be guaranteed. Because the expiration time of the lock is always bounded, unless the expiration time is not set or the expiration time is set very long, this will also bring other problems. So it doesn't make sense. To achieve a relatively safe distributed lock, you must rely on the value of the key. When the lock is released, the uniqueness of the value is used to ensure that it will not be deleted. We implement atomic get and compare based on the LUA script, as follows:

public void safedUnLock(String key, String val) {
    
    
    String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'"";
    RedisScript<String> redisScript = RedisScript.of(luaScript);
    redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val));
}

We use LUA scripts to securely unlock.

Achieve safe inventory verification

If we have a deeper understanding of concurrency, we will find that operations such as get and compare/read and save are all non-atomic. If we want to achieve atomicity, we can also use LUA scripts to achieve it. But in our example, since only one bottle can be placed in a panic buying activity, it can be based on the atomicity of redis instead of LUA script implementation. the reason is:

// redis会返回操作之后的结果,这个过程是原子性的
Long currStock = redisTemplate.opsForHash().increment("key", "stock", -1);

Found no, the inventory check in the code is completely "superfluous".

Improved code

After the above analysis, we decided to create a new DistributedLocker class specifically for handling distributed locks.

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
    
    
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId();
    String val = UUID.randomUUID().toString();
    try {
    
    
        Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS);
        if (!lockFlag) {
    
    
            // 业务异常
        }

        // 用户活动校验
        // 库存校验,基于redis本身的原子性来保证
        Long currStock = stringRedisTemplate.opsForHash().increment(key + ":info", "stock", -1);
        if (currStock < 0) {
    
     // 说明库存已经扣减完了。
            // 业务异常。
            log.error("[抢购下单] 无库存");
        } else {
    
    
            // 生成订单
            // 发布订单创建成功事件
            // 构建响应
        }
    } finally {
    
    
        distributedLocker.safedUnLock(key, val);
        // 构建响应
    }
    return response;
}

Deep thinking

Is distributed lock necessary?

After the improvement, we can actually find that we can guarantee that we will not be oversold with the help of the atomic deduction of redis itself. correct. But if there is no such lock, then all the requests will go through the business logic. Due to the dependence on other systems, the pressure on other systems will increase at this time. This will increase performance loss and service instability, and the gain is not worth the loss. Based on distributed locks, some traffic can be intercepted to a certain extent.

If you need more interview materials from major companies, you can also click to enter directly and get it for free! Password: CSDN

Selection of distributed locks

Someone proposed to use RedLock to implement distributed locks. RedLock is more reliable, but at the cost of sacrificing certain performance. In this scenario, this improvement in reliability is far inferior to the cost-effectiveness brought about by the improvement in performance. For scenarios with extremely high reliability requirements, RedLock can be used to achieve it.

Is it necessary to think again about distributed locks?

Because the bug needs to be repaired urgently, we optimized it and performed a stress test in the test environment, and immediately deployed it online. It turns out that this optimization is successful, and the performance is slightly improved. In the case of distributed lock failure, there is no oversold situation. However, is there room for optimization? some! Since the service is deployed in a cluster, we can distribute the inventory equally to each server in the cluster, and notify each server in the cluster through broadcast. The gateway layer uses a hash algorithm based on the user ID to determine which server to request. In this way, inventory deduction and judgment can be realized based on the application cache. Performance has been further improved!

// 通过消息提前初始化好,借助ConcurrentHashMap实现高效线程安全
private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>();
// 通过消息提前设置好。由于AtomicInteger本身具备原子性,因此这里可以直接使用HashMap
private static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = new HashMap<>();

...

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
    
    
SeckillActivityRequestVO response;

    Long seckillId = request.getSeckillId();
    if(!SECKILL_FLAG_MAP.get(requestseckillId)) {
    
    
        // 业务异常
    }
     // 用户活动校验
     // 库存校验
    if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) {
    
    
        SECKILL_FLAG_MAP.put(seckillId, false);
        // 业务异常
    }
    // 生成订单
    // 发布订单创建成功事件
    // 构建响应
    return response;
}

Through the above transformation, we don't need to rely on redis at all. Both performance and safety can be further improved! Of course, this solution does not consider complex scenarios such as dynamic expansion and shrinkage of the machine. If these are still to be considered, it is better to directly consider the solution of distributed locks.

to sum up

The oversold of scarce commodities is definitely a major accident. If the amount of oversold is large, it will even have a very serious operating and social impact on the platform. After this accident, I realized that no line of code in the project should be taken lightly, otherwise in some scenarios, these normally working codes will become deadly killers! For a developer, when designing a development plan, the plan must be considered thoroughly. How can the plan be considered comprehensively? Only continuous learning!

Reader benefits

Thank you for seeing here!
I have compiled a lot of 2020 latest Java interview questions (including answers) and Java study notes here, as shown below
Insert picture description here

The answers to the above interview questions are organized into document notes. As well as interviews also compiled some information on some of the manufacturers & interview Zhenti latest 2020 collection (both documenting a small portion of the screenshot) free for everyone to share, in need can click to enter signal: CSDN! Free to share~

If you like this article, please forward it and like it.

Remember to follow me!
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_49527334/article/details/111858176