Byte side: What is the relationship between transaction compensation and transaction retry?

Said it in front

In Nien's (50+) reader community, he often guides everyone in the interview structure and gets high-end offers. Recently, during the interview process of Byte and Ping An, a friend encountered a very, very frequent interview question, but it was difficult to answer, similar to the following:

  • Talk about the compensation mechanism in distributed, what is the relationship between compensation and retry?
  • What is the relationship between "transaction compensation" and "retry"?
  • Talk about how to design the compensation mechanism in the distributed system

Here Nien will give you a systematic and systematic review, so that you can fully demonstrate your strong "technical muscles" and make the interviewer "can't help himself and drool" .

This question and reference answers are also included in the V99 version of our " Nien Java Interview Guide PDF " for reference by subsequent friends to improve everyone's 3-level architecture, design, and development levels.

For the latest PDFs of "Nien Architecture Notes", "Nien High Concurrency Trilogy" and " Nien Java Interview Guide ", please go to the official account [Technical Freedom Circle] to obtain

1. Why should we consider the compensation mechanism?

We all know that applications running in a distributed environment may encounter a major problem when communicating, that is, a business process usually needs to integrate multiple services, and only one communication may involve DNS services, network cards, switches, routers , load balancing and other equipment.

Take the e-commerce shopping scene as an example:

Client---->Shopping Cart Microservice---->Order Microservice---->Payment Microservice.

This kind of call chain is very common.

So why do we need to consider compensation mechanisms?

As mentioned earlier, a cross-machine communication may go through DNS services, network cards, switches, routers, load balancing and other equipment. These equipment are not necessarily stable all the time. During the entire process of data transmission, as long as any link goes wrong, , will cause problems.

In a distributed scenario, a complete business is composed of multiple cross-machine communications, so the probability of problems increases exponentially.

These services and equipment are not always stable and reliable. During the data transmission process, as long as there is a problem in any link, it may cause a failure.

In a microservice environment, this situation is even more prominent because the business needs to be guaranteed in terms of consistency.

That is, if a step fails, you either need to continuously retry to ensure that all steps are completed successfully, or roll back the service call to the previous state.

Therefore, we can understand business compensation this way: when an exception occurs in an operation, the "inconsistent" state caused by the exception is eliminated through internal mechanisms.

We often see: "compensation" and "transaction compensation" or "retry". What is the relationship between them?

2. How to compensate?

The implementation methods of business compensation design can be mainly divided into two types: rollback (transaction compensation) and retry

  • Rollback (transaction compensation) , which is a reverse operation that solves problems by rolling back the business process, which means that the current operation has failed;

  • Retry , which is a forward operation, continues to try to complete the business process, which means there is still a possibility of success.

Typically, business transaction compensation requires the support of a workflow engine. This transactional workflow engine connects various services together and performs business compensation on workflows to achieve eventual consistency.

Because "compensation" is already an additional process, since it can go through an additional process, it means that timeliness is not the first consideration. Therefore, the core point of compensation is: it is better to be slow than to make mistakes.

Therefore, compensation packages cannot be determined hastily and need to be carefully evaluated. While errors cannot be completely avoided, we should aim to minimize them.

1. Rollback

There are two forms of rollback:

  • Explicit rollback (reverse call interface) : By calling the reverse interface, perform the reverse operation of the last operation, or cancel the last unfinished operation (requires locking resources);
  • Implicit rollback (no need to reverse the call interface) : It means that this rollback action does not require additional processing, and the failure handling mechanism is usually provided by the downstream.

explicit rollback

The most common display rollback is to do two things:

  • First, determine the failed operation and status to determine the rollback scope. A business process is planned from the beginning of the design, so it is relatively easy to determine the rollback scope. However, it should be noted that if not all services involved in a business processing process provide a "rollback interface", then the services that provide a "rollback interface" should be placed first when arranging services, so that subsequent services can There is also an opportunity to "roll back" when an error occurs.
    In short, make sure the rollback interface has a chance to be called. The optimal choice is to put it first.
  • Secondly, you must provide the business data required for the "rollback" operation. The more rollback data provided, the more beneficial it is to the robustness of the program. Because the program can perform business checks when receiving the "rollback" operation, such as checking whether the accounts are equal, whether the amounts are consistent, etc.

During this process, the data structure and size are not determined. Therefore, it is better to serialize the relevant data into JSON and store it in a NoSQL database.

Implicit rollback

There are relatively few use cases for implicit rollback. It means that the rollback operation requires no additional processing, and the downstream service has internal mechanisms similar to "preemption" and "timeout failure".

For example:

In the e-commerce scenario, the goods in the order will be reserved in stock and wait for the user to pay within the specified time. If the user does not pay within the stipulated time, the inventory is released.

How to implement rollback

For cross-database transactions, common solutions include: two-phase commit and three-phase commit (ACID). However, these two methods are generally not advisable in high-availability architectures because cross-database lock tables will consume a lot of performance. .

In a high-availability architecture, strong consistency is usually not required, but eventual consistency is pursued. You can consider using transaction tables, message queues, compensation mechanisms, TCC mode (occupancy/confirmation or cancellation) and Sagas mode (split transactions + compensation mechanism) to achieve eventual consistency.

2. Try again

The meaning of "retry" is that we think the failure is temporary, not permanent, so we will try again. The biggest advantage of this method is that there is no need to provide additional reverse interfaces, which is advantageous for code maintenance and long-term development costs. At the same time, considering changes in the business, the reverse interface also needs to change accordingly. Therefore, in many cases, you can consider using retries.

scenes to be used

However, compared to rollback operations, retries have fewer usage scenarios.

  • When the downstream system returns a request timeout or is affected by temporary conditions such as current limiting, we can consider retrying.
  • If the returned result is a clear business error such as insufficient balance or no permission, there is no need to retry.
  • For some middleware or RPC frameworks, if the return is 503, 404 and other errors whose recovery time cannot be expected, there is no need to retry.

Retry strategy

In order to implement retry, we need to formulate a retry strategy. The mainstream retry strategies mainly include the following:

**1. Retry immediately:** If the failure is temporary, it may be caused by events such as network packet conflicts or peak traffic of hardware components. In this case, it is appropriate to retry immediately. However, the number of immediate retries should not exceed one, and if the immediate retry fails, another strategy should be used instead.

2. Fixed interval: This is easy to understand, such as retrying every 5 minutes. It should be noted that strategy 1 and strategy 2 are usually used for the interactive operation of front-end systems.

3. Incremental interval: This is also very simple, such as retrying every 15 minutes.

return (retryCount - 1) * incrementInterval;

Its main purpose is to lower the priority of tasks that fail to retry, and let new retry tasks enter the queue.

4. Exponential Interval: Similar to Incremental Interval, but with a larger increase.

return 2 ^ retryCount;

5. Full jitter: Increase randomness on an incremental basis, and is suitable for scenarios where there are a large number of requests at a certain moment that need to disperse the pressure.

return random(0 , 2 ^ retryCount);

6. Equal jitter: Find a balance between exponential interval and full jitter to reduce the use of randomness.

int baseNum = 2 ^ retryCount;
return baseNum + random(0 , baseNum);

The performance of strategies 3, 4, 5, and 6 is roughly as follows. (x-axis is the number of retries)

Why is there a pitfall in retrying?

As mentioned before, due to the consideration of development cost, if the retry involves interface calls, the issue of idempotence needs to be considered .

Idempotence originated as a mathematical concept and was later introduced into programming. It means that an operation can be performed multiple times without generating errors.

Therefore, once a function supports retry, the decoupling on the entire link needs to consider the issue of idempotence to ensure that multiple calls will not cause changes in business data.

The way to achieve idempotence is to filter it out:

  1. Assign a unique identifier to each request.
  2. During the retry process, determine whether the request has been executed or is being executed. If so, discard the request.

For the first point, you can use a global ID generator, an ID generation service, or simply assign a value to each request using a Guid or UUID.

For the second point, you can use AOP to verify before and after the business code.

//【方法执行前】
if(isExistLog(requestId)){
    
      //1。判断请求是否已被接收过。对应序号3
    var lastResult = getLastResult();  //2。获取用于判断之前的请求是否已经处理完成。对应序号4
    if(lastResult == null){
    
     
        var result = waitResult();  //挂起等待处理完成
        return result;
    }
    else{
    
    
        return lastResult;
    } 
}
else{
    
    
    log(requestId);  //3。记录该请求已接收
}
//do something。。【方法执行后】
logResult(requestId, result);  //4。将结果也更新一下。

If the "compensation" process is performed through Message Queuing (MQ), it can be implemented directly in the MQ-encapsulated SDK. The request is assigned a globally unique identifier on the production side, and the unique identifier is used for deduplication on the consumer side.

Best practices for retrying

Retries are particularly useful for degradation under high load situations. At the same time, it should also be affected by current limiting and fusing mechanisms. Best results are achieved when retry is used in conjunction with current-limiting fusing.

When adding compensation mechanisms, it is necessary to weigh inputs and outputs. For some less important problems, you should choose "fail fast" instead of "retry".

An overly aggressive retry policy (such as too short an interval or too many retries) may have a negative impact on downstream services and requires special attention.

Be sure to set a termination policy for "retries". When the rollback process is difficult or costly, longer intervals and more retries are acceptable. In fact, the "saga" model often mentioned in DDD is also based on this idea. But only if other operations are not blocked by reserving or locking scarce resources (for example, 1, 2, 3, 4, 5 serial operations, 3, 4, 5 cannot proceed because 2 operation has been outstanding).

3. Things to note about the business compensation mechanism

1. ACID or BASE

In distributed systems, ACID and BASE represent two different levels of consistency theory.

In a distributed system, the difference between ACID and BASE:

  • ACID has strong consistency but poor scalability and should only be used when necessary;
  • BASE's consistency is relatively weak, but it has good scalability, supports asynchronous batch processing, and is suitable for most distributed transactions.

In the context of retries or rollbacks, we usually don't need strong consistency, just ensure eventual consistency.

2. Things to note when designing business compensation

Things to note when designing business compensation:

  • In order to complete a business process, the services involved need to support idempotency, and the upstream needs to have a retry mechanism;
  • We need to carefully maintain and monitor the status of the entire process, so it is best not to distribute these statuses among different components. It is best to be responsible for it by a business process controller, that is, a workflow engine. Therefore, this workflow engine needs to be highly available and stable;
  • Compensation's business logic and flow don't have to be strictly reverse operations. Sometimes it can be done in parallel, sometimes it can be simpler.
    In general, when designing the business forward process, it is also necessary to consider the reverse compensation process of the business;
  • We need to make it clear that the business logic of business compensation is closely related to the specific business and is difficult to be universal;
  • It is best for the lower-level business side to provide a short-term resource reservation mechanism. For example, in e-commerce, product inventory can be reserved to wait for users to pay within 15 minutes. If the user's payment is not received, the inventory will be released, and then rolled back to the previous order operation, waiting for the user to place an order again.

So, this is the "textbook" answer

Combined with the Byte solution, let’s go back to the previous interview questions:

  • Talk about the compensation mechanism in distributed, what is the relationship between compensation and retry?
  • What is the relationship between "transaction compensation" and "retry"?
  • Talk about how to design the compensation mechanism in the distributed system

The above solution is the perfect answer, the "textbook" answer.

In the future, Nien will analyze more and more exciting answers based on industry cases.

Of course, if you encounter such problems, you can ask Nien for help.

references

https://zhuanlan.zhihu.com/p/258741780

recommended reading

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!" "

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!" "

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?" "

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132456546