Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios? Embarrassed

Guide

Alimei's Guide: Cross-regional, that is, the concept of "remote double living" and "remote multiple living". In the case of rapid business development, our services need to be deployed across regions to meet the needs of nearby access and cross-regional disaster recovery in various regions. In this process, it is inevitable that cross-regional distributed consistency will be involved. problem. The network delay problem caused by the cross-regional network, as well as a series of problems derived from the network delay, is a great challenge for the design and construction of a cross-regional distributed consistency system. The industry has many solutions to this problem. All hope to solve the consistency problem in cross-regional scenarios.

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

This article shares Alibaba's Nüwa team's exploration of distributed consistency systems in cross-regional scenarios. From the three aspects of "What How Future", introduces the needs and challenges to be undertaken in cross-regional scenarios, common systems in the industry, and Nüwa team Thinking about the trade-offs of cross-regional scenarios, as well as designing and thinking about the future development of the cross-regional consistency system, to discover and solve more needs and challenges in cross-regional scenarios.

1. Cross-regional needs and challenges

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

1 demand

Cross-regional issues are challenges brought about by rapid business development under the group's globalization strategy. For example, the unitized business of Taobao or the regionalized business of AliExpress has an unavoidable problem-the consistency of data reading and writing across regions.

The core requirements can be summarized as the following:

  • Cross-regional business scenarios

Cross-regional configuration synchronization and service discovery are two common business requirements for cross-regional consistent coordination services. Cross-regional deployment can provide nearby access capabilities to reduce service delays. According to specific business scenarios, it can be divided into multi-regional writing or simplified single Scenarios such as regional write, strong consistency read, or eventual consistency read. Cross-regional session management and cross-regional distributed locks based on this also urgently need to provide mature solutions.

  • Service and resource expansion

When the service capacity of a certain computer room in a region reaches the upper limit and cannot be expanded, a consistent system is required to expand at the level of multiple computer rooms in a region and be able to expand across regions.

Cross-regional disaster tolerance

When encountering a catastrophic failure in a computer room or a region, a consistent system is required to quickly migrate businesses from one region to another region through cross-regional service deployment to complete disaster recovery and escape, and achieve high availability.

2 Challenge

Combining network latency and business requirements, we can summarize the challenges that a cross-regional consistency system needs to resolve:

  • Latency: network delay up to tens of milliseconds

The core problem brought by multi-regional deployment is high network latency. Take our online cross-regional deployment of cross-regional clusters as an example. The machines in the cluster belong to the four regions of Hangzhou, Shenzhen, Shanghai, and Beijing. The actual test of the Hangzhou computer room The delay to Shanghai is about 6ms, and the delay to Shenzhen and Beijing can reach close to 30ms. The network latency of the same computer room or the same region's computer room is generally within milliseconds. In contrast, the cross-region access latency has increased by an order of magnitude.

  • Horizontal expansion: Quorum Servers are limited in scale

The distributed consistency system based on Paxos theory and its variants will inevitably encounter the Replication Overhead problem when expanding nodes. Generally, the number of nodes in a Quorum is not more than 9, so it is impossible to simply deploy the consistency system nodes directly in multiple nodes. In this region, the system needs to be able to continuously expand horizontally to meet the expansion needs of services and resources.

  • Storage limit: storage data of a single  node is limited and failover recovery is slow

Whether it is MySQL or a Paxos-based consistency system, a single node will maintain and load the full amount of mirrored data, which will be limited by the capacity of a single cluster. At the same time, during failover recovery, if the data version is too late, it will be unavailable for a long time to restore by pulling other regional mirrors.

2. Our exploration

1 Industry solutions

The industry has many designs for cross-regional consistency systems, mainly referring to the paper [1] and some open source implementations. Here are some common ones:

Cross-regional deployment

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 1 Direct cross-regional deployment

Directly deployed across regions, read requests directly read local domain nodes, faster, consistency and availability are guaranteed by Paxos, and there is no single point of problem. The shortcomings are also obvious. You will encounter the level expansion problem mentioned in the first part, that is, the Replication Overhead problem will be encountered when Quorum expands. And as the number of Quorum nodes increases, under the extremely high network latency across regions, it takes a long time for the majority to reach an agreement each time, and the writing efficiency is very low.

Single-region deployment + Learner role

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 2 Introducing Learner role

By introducing the role of Learner (such as Observer in zk, raft learner[2] of etcd), that is, a role that only performs data synchronization without participating in majority voting, the write request is forwarded to a certain region (Region A in Figure 2) , To avoid the voting delay problem of direct multi-node deployment. This method can solve the problem of horizontal expansion and delay, but because the roles participating in the voting are all deployed in a region, when the computer room in this region encounters a catastrophic time, the writing service will be unavailable. This method is the deployment method adopted by Otter [3].

Multi-service+Partition&Single-region deployment+Learner

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 3 Partition of multiple service processing

Divide the data into different partitions according to the rules, each region provides a Quorum to provide services, different regions Quorum is responsible for different partitions, between regions Quorum uses Learner to synchronize different Partition data filling requests and forwards, to ensure that problems in a certain region only affect the region Partition Availability. At the same time, there will be correctness problems under this kind of scheme, that is, the operation does not conform to the problem of sequential consistency [4] (see the paper [1]).

In actual implementation, there are various solutions according to business scenarios, which will be optimized and weighed to make up for defects. The most common solution in the industry is a single-region deployment + Learner role, which ensures high availability and high efficiency through multi-activity in the same city and cross-regional data synchronization with Learner. Other solutions also have their own optimization solutions. Cross-regional deployment can reduce latency and bandwidth issues by reducing inter-regional communication when reaching a resolution, such as TiDB's Follower Replication[5]; multi-service + Partition & single-region deployment + Learner. The correctness of the scheme can also be described in the paper [1], adding a sync operation before reading, sacrificing part of the read availability to ensure consistency.

The final conclusion is as follows, and the key items will be explained in detail later:

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

 

2 Cross-regional trade-offs

Through the requirements and challenges summarized in the first part and the previous research on the industry's cross-regional consistency system solutions, we can summarize the core trade-offs of the Paxos-based distributed consistency system in cross-regional scenarios:

  • The write operation cross-regional consistency agreement is too slow to reach a resolution
  • Multi-life in the region cannot provide availability under extreme conditions
  • Need to have the core level expansion ability of a distributed system

In response to these three issues, we designed a cross-regional consistency system with log mirroring decoupling.

3 Decoupling of cross-regional log mirroring

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 4 Schematic diagram of log mirroring decoupling

As shown in Figure 3, our system is divided into a back-end log synchronization channel and a front-end full state machine—a decoupling architecture of log and mirroring. The back-end cross-regional global log synchronization channel is responsible for ensuring strong consistency of request logs in various regions; the front-end full state machine is deployed in each region to process client requests, and is also responsible for interacting with the back-end log service to provide strong global consistency externally Metadata access service, the interface can be implemented by quickly modifying the state machine according to business requirements.

Under the architecture where the global log is separated from the local mirroring, in addition to the system operation and maintenance and scalability improvements brought about by the decoupling itself, we can also solve many problems under the uncoupled architecture. The following analysis is based on this architecture How to solve the major problems in the previous thoughts:

Write efficiency

From the perspective of the deployment mode, it looks similar to the method of direct multi-region multi-node deployment, and then adding the Learner role in each region. It is a combination of direct multi-node deployment and the introduction of Learner, which combines the advantages and disadvantages of the two methods. The biggest difference is that our logs and mirrors are decoupled, which means that the cross-regional part is a simple log synchronization that is light enough and efficient, and because each region has only one node, it can save cross-regional bandwidth (similar to TiDB's Follower Replication ). At the same time, the back-end log synchronization channel can also realize the function of multiple groups, dividing the data into Partitions, and each consistent group is responsible for different Partitons.

Since most of the read operations in business scenarios are reading local data, the various methods are not much different, mainly for the delay analysis of the write operation, the following is the delay analysis for the write operation (or strong consistency read):

RTT (Round-Trip Time) can be simply understood as the elapsed time from the sending of the request to the response of the sender. Due to the large delay of the cross-regional network, the following RTT mainly refers to the cross-regional RTT.

(1) Directly deploy across regions

For a common master agreement, our request is divided into two situations:

  • 1 RTT in the region where the leader is visited (ignoring the smaller delay in the region for now)
Client -> Leader ----> Follower ----> Leader -> Client
  • Visit 2 RTTs in the location of Follower
Client -> Follower ----> Leader ----> Follower ----> Leader ----> Follower -> Client

(2) Single-region deployment + Learner synchronization

In the scheme of being more active in a region and learning learner synchronization between regions, our delay is:

  • 0 RTT for local domain
Client -> Quorum -> Client
  • 1 RTT between regions
Client -> Learner ----> Quorum ----> Learner -> Client

(3) Multi-service Partition, single-region deployment + Learner synchronization (similar to B result)

  • Write 0 RTT of local domain Partition
  • Write 1 RTT across Partition

(4) Log mirror decoupling architecture (similar to the result of A)

  • Write local domain Partiton 1 RTT
Client ->Frontend -> LogChannel(local) ----> LogChannel (peer) ----> LogChannel (local) -> Frontend -> Client
  • Write 2 RTTs across Partitions (Paxos two-phase commit/forward leader)
Client ->Frontend -> LogChannel (local) ----> LogChannel (peer) ----> LogChannel (local) ----> LogChannel (peer) ----> LogChannel (local) -> Frontend -> Client

After the above comparison, it can be seen that as long as the consistency protocol is used for write operations across regions, there will be at least 1 RTT delay. If Paxos Quorum is only deployed in a single region, it cannot guarantee the availability in any extreme cases. Therefore, we can balance the availability and writing efficiency according to business needs. The log mirroring decoupling architecture can ensure extreme availability and correctness in multi-region deployment scenarios. Of course, the efficiency will be slightly worse than single-region deployment + learner. However, multi-integral deployment is more lightweight and efficient than direct multi-regional deployment, because the scale of Quorum will not increase due to horizontal expansion and will not affect voting efficiency. The solution with multi-service partition deployment has no efficiency advantages, but has advantages in terms of transportability, correctness, and availability.

The strong consistency between cross-region deployment and single-region deployment + Learner is satisfactory. Zookeeper and etcd have corresponding introductions, so I won’t repeat them here. The solution of multi-service Partition and Partion does not satisfy sequential consistency, mainly because multi-service cannot guarantee the order of each write operation commit, as shown in the figure below:

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 5 Sequence consistency

It can be seen that when two clients modify x and y at the same time, in the case of a high degree of concurrent write operations, sequential consistency cannot be guaranteed.

Sequence consistency means that the operations of each Client can be arranged in a correct order. In the example in Figure 4:

set1(x,5) => get1(y)->0 => set2(y,3) => get2(x)->5 

or

set2(y,3) => get2(x)->0 => set2(y,3) => get1(y)->3 

All are in line with sequential consistency.

The consistency of the log mirroring decoupling architecture can be simply understood as cross-regional deployment + Learner. The write operation has a sync option. It will return success only when the back-end log is submitted successfully and the corresponding log is pulled, so it must be pulled To the log corresponding to the write operation of other clients before this operation, it conforms to sequential consistency.

  • Availability

Availability is similar to the availability of direct cross-regional multi-node deployment. The front-end state machine can forward requests when the back-end node of a certain region is down. It can also provide read availability when the back-end global log service is unavailable. High availability guarantee for reading and writing in extreme cases.

At the same time, because the image is stored in the state machine of each region, when a front-end state machine hangs up, the client can be switched to other front-ends. When failover is restored, the data can also be pulled directly from the back-end to restore, in the case of too much backwardness It is only necessary to pull mirrors from other front-ends in the local domain, instead of synchronizing mirrors across regions, which can make the front-end unavailable time extremely short.

  • Level development ability

Horizontal expansion capability is the core capability of distributed services. Among the aforementioned multiple solutions, the horizontal expansion capability for direct cross-regional deployment is extremely poor. Other methods that rely on Learner also solve the problem of horizontal expansion, but decoupling has no log mirroring solution. The coupled design is clean.

Summarize and compare the above key issues:

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

 

Three, more possibilities across regions

With the back-end log and front-end mirroring decoupled, our exploration of cross-regional scenarios is divided into two parts-the back-end log synchronization is lightweight and efficient and the front-end state machine is flexible and rich.

  • Lightweight is reflected in the architecture. The backend storage pressure caused by only synchronizing logs on the backend is minimal, and only lightweight incremental logs are synchronized.
  • Efficient, reflected in the back-end consistency protocol, due to its light weight, only the logic of voting and election needs to be considered, and only the improvement of log synchronization efficiency is needed, and the back-end resources are not consumed on other business logic.
  • Flexibility is reflected in the architecture. The front-end can customize the upload log, and CAS, transactions, etc. can be packaged into logs for analysis and processing by the front-end.
  • Abundance is mainly reflected in the state machine at the front end. Because the flexibility of the log leaves us with a lot of space for exploration and construction, we can package a state machine for handling various complex transactions according to needs.

There are new problems under the new architecture. This part mainly explores how to absorb the advantages of existing systems, and use the light weight and flexibility of log mirroring decoupling to realize the efficiency and richness of consistency protocols and state machines in cross-regional scenarios. There will also be a thinking and plan for the subsequent development of the cross-regional consistency system. The overall goal is to make the back-end consensus protocol more sophisticated and the front-end state machine to be bigger and stronger.

1 Efficient back-end consensus protocol

Based on our previous discussion on the efficiency of write operations, in the scenario of writing the same data in multiple regions, the delay can only be controlled at 2RTT. Because in the cross-regional scenario, the delay ratio is mainly in the cross-regional network communication, whether it is a master forwarding or an unowned Paxos two-phase commit, the delay has 2RTT. However, if you use an unowned protocol, such as the Paxos variant EPaxos [6], you can improve the efficiency of writing in cross-regional scenarios as much as possible. The delay is divided into Fast Path and Slow Path. The delay is 1RTT under Fast Path. The delay under Slow Path is 2RTT.

Quoting a sentence from the article introducing EPaxos:

If there is no conflict between the logs of concurrent proposals, EPaxos only needs to run the PreAccept phase to submit (Fast Path), otherwise it needs to run the Accept phase to submit (Slow Path).

Compared with the partition operation, if the back-end consistency protocol is selected as EPaxos, it can guarantee the availability in extreme cases and the delay of 1RTT in most cases. This is the advantage of the masterless consistency protocol in cross-regional scenarios. The main reason is that the RTT for forwarding the leader operation is omitted. At present, our system uses the most basic implementation of Paxos. In the multi-location writing scenario, the delay is theoretically similar to the master protocol. Subsequent development expects to use EPaxos to accelerate the efficiency of writing operations in cross-regional scenarios.

Since there is no need to implement various business logics, efficiency is the biggest requirement of the back-end consistency protocol. Of course, its correctness and stability are also essential. For the front-end state machine, there are rich scenarios to design and play. .

CAS operation

The implementation of CAS operations under this architecture is very natural. Since the backend only has a consistent log, every CAS request we have will naturally have the order of Commit, for example.

Two clients write the value of the same Key at the same time:

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

Figure 6 Schematic diagram of CAS operation

At the beginning, the value of the key is 0. At this time, Client 1 and Client 2 perform CAS operations on the key concurrently. They are CAS(key, 0, 1) and CAS(key, 0, 2). When these two operations are submitted and After Commit, due to the order in which the back-end Quorum reached the resolution, the Replication Log must have a sequence, so naturally these two concurrent CAS operations are converted to sequential execution. When Frontend synchronizes to the log of these two operations, it will apply these two operations to the local state machine in turn. Naturally, CAS(key, 0, 1) succeeds, the update key value is 1, and CAS(key, 0, 2) If the update fails, the front end will return to the corresponding client the result of whether the CAS request was successful or failed.

The principle is to turn a concurrent operation into a serial process executed sequentially, thereby avoiding the operation of locking in cross-regional scenarios. It is conceivable that if the backend maintains a kv structure data, it needs to be added A cross-regional distributed lock to complete this operation is relatively more cumbersome and efficiency is not guaranteed. By only synchronizing the log to transfer complex calculations to Frontend, the front-end state machine can be constructed flexibly to better implement CAS or more complex transaction functions (for this architecture, please refer to Pravega's StateSynchronizer [7]).

Global ID

Global ID is a common requirement. Distributed systems generate a unique ID. UUID, snow flake algorithms, or solutions based on databases, redis, and zookeeper are common.

Similar to using the znode data version of zookeeper to generate the Global ID, in this log mirroring separation architecture, you can use the CAS interface to call, generate a key as the Global ID, and perform atomic operations on the Global ID each time. Based on the above CAS design, there is no need to lock in cross-regional concurrency scenarios, and the usage is similar to redis for atomic operations on keys.

2 Watch operation

The subscription function is indispensable for distributed coordination services and is the most common business requirement. The following are the survey results of zk and etcd:

At present, the industry's more mature distributed coordination systems that implement subscription notifications include ETCD and ZooKeeper. We take these two systems as examples to explain their solutions.

ETCD will save multiple historical versions (MVCC) of data, and indicate the new and old versions by monotonically increasing version numbers. As long as the client passes in the historical version it cares about, the server can push all subsequent events to the client.

Zookeeper does not save multiple historical versions of data, only the current data state, the client cannot subscribe to the historical version of the data, the client can only subscribe to the change event after the current state, so the subscription is accompanied by the reading, and the server takes the current The data is sent to the client, and then the subsequent events are pushed. At the same time, to prevent subscribing to old data and events in abnormal scenarios such as failover, the client will refuse to connect to the server with the older data (this depends on the server’s The request will return the current server's global XID).

Interviewer: How to solve the consistency of distributed systems in cross-regional scenarios?  Embarrassed

 

In the above survey results, ETCD is more in line with our interface design. At present, ETCDv3 uses HTTP/2 TCP link multiplexing, and watch performance is improved. Because it is the same log plus state machine structure, the design function mainly refers to ETCD v3, and learns from its two characteristics of how to subscribe to multiple keys and return all historical events. To achieve the function of etcd subscription, when we synchronize the front-end state machine and parse the log, if a log is written, the state machine Store of the kv structure and the state machine watchableStore specially provided for the watch interface will be updated at the same time. The specific implementation can be fully referenced etcd, then according to the log version number, all historical events after the subscription version are returned to the client. Subscribing to multiple keys also uses the line segment tree as the watcher's range keys storage structure, which can implement watcher notifications for watch range keys.

3 Lease mechanism

It is a big challenge to achieve an efficient Lease mechanism in an unowned system. There is no leader in an unowned system. Any node can maintain the Lease. Lease is distributed on each node. When a node is unavailable, it needs to be smoothly switched to other nodes. . The difficulty of implementing an efficient Lease mechanism in an unowned system is how to avoid a large number of Lease maintenance messages in the back-end consistency protocol as the number of Lease increases, which affects system performance. It is best to make Lease maintenance messages directly in the front end. Processing without going through the back end. So our idea is to aggregate the Lease of the client and front-end into the Lease of the front-end and the back-end, so that Lease maintenance messages can be directly processed locally at the front-end.

Four, conclusion

With the advancement of the globalization strategy, cross-regional needs will become more and more urgent, and the real pain points of cross-regional scenarios will become clearer and clearer. I hope that our cross-regional research and exploration can give you an idea and reference. , We will continue to explore more possibilities under the architecture of cross-regional log mirroring separation.

I hope I can help you learn distributed systems to a certain extent, and your favorite friends can help LZ forward + follow, thank you! (LZ will also try its best to share valuable learning materials with you!)

Guess you like

Origin blog.csdn.net/Ppikaqiu/article/details/112967220