Distributed storage engine manufacturers' actual combat-an article to understand the reliability of distributed storage

background

  The previous section Distributed Storage Engine Dachang Actual Combat-Application of Consistent Hash in Dachang introduced how distributed hash table (DHT) in distributed storage solves data distribution shock. And this chapter will focus on how to ensure data reliability after distributed storage. Reliability can be said to be one of the important indicators of storage systems. Under normal circumstances, multiple copies are generally stored to improve the high reliability of the storage system. Multiple copies are to copy data into multiple copies and store them in different places to achieve redundant backup. Only theoretically speaking, the more copies of data are stored, the higher the reliability, but the available space also decreases. For example, if 2 copies are used to store data in a 100T space, then only 50T of data can be stored (50T 2=100T). Now the number of copies is increased and changed to 5 copies, then only 20T of data can be stored for a long time. . If you want to keep multiple copies of data without losing data, saving multiple copies will result in low space utilization. Therefore, there is a trade-off between the two. In the high reliability of the software system (also known as availability, English described as HA, High Available), there is a standard to measure its reliability-X 9s, this X represents the number 3~5, X 9s During the one-year use of the software system, the ratio of the normal use time of the system to the total time (1 year), according to the industry’s typical 5 9 reliability requirements, (1-99.999%) 365 24 60=5.26 minutes, That is, the business interruption time within one year of system operation does not exceed 5.26 minutes at most. Generally speaking, save the copy as 3 copies.

Reliable disk level

  What is disc-level reliability? To put it bluntly, take N copies as an example. If N-1 disks are faulty, the data can still be obtained normally. Looking at 3 copies according to this requirement, in fact, it is required that 3 copies of data must be stored across disks. The schematic diagram is as follows:
Insert picture description here

Disk-level fault tolerance

  If the two disks fail, you can still get the data. However, since the three copies of the disks are all deployed on the same physical node, assuming that the node fails, all three copies will be lost. At this time, node-level fault tolerance is required .

Node-level fault tolerance

  According to the concept of disk-level fault tolerance, if there are N copies, if N-1 nodes fail, the data can still be obtained. To do this, copies need to be stored across nodes (cannot be stored on one node):
as shown in the figure below, each blue box represents a physical node, and white represents a hard disk. You will find a copy of A1 and a copy of A2. A3 is distributed on three different nodes
Insert picture description here

Rack-level fault tolerance

  N copies, random failure of N-1 cabinets, data can still be obtained. Then N copies are required to be distributed on different N cabinets, as shown in the following figure: Green represents a physical cabinet, blue is a physical node, and white is a physical hard disk.
Insert picture description here
  Through the above three figures, you should be able to find the rule: if it is an institution-level disaster recovery, it must meet the node-level disaster recovery and the disk-level disaster recovery, and if it is a node-level disaster recovery, it must meet the disk-level disaster recovery .
  In summary, the above content outlines three disaster-tolerant data storage methods. How to achieve this reliability in the actual storage IO process? Since storage requires high concurrency, it is not appropriate to use distributed transactions. Next, we will talk about a combination.

Reliability algorithm implemented by optimistic lock

  The local operation and data exchange of the distributed system are treated as events. In an ideal state, if the sequence and causality of events are in total order, in order to solve the total order relationship of the distributed system, the processes in the system follow the total order relationship. It is easy to reach agreement when executing the processing event. The actual situation is that the spatial isolation of processes, clock drift, network delay and other reasons make it difficult to determine the total sequence of events. In order to solve this total order problem, the Vector Clock is
  adopted . The Quorum NRW model is adopted: N: the number of replicated nodes, R: the minimum number of nodes for a successful read operation, W: the minimum number of nodes for a successful write operation, only W + R> N . It can guarantee strong consistency. Take three copies as an example, NRW= 3:2:2. This will ensure that the latest data is read.

Write operation flow

  The client communicates with the master from the triplet of the three copies. After the master receives the write request, it will write the write synchronously to the two backups. As long as the master + 1 backup succeeds, it will return success to the client, and the other Whether the device is successful does not affect the success of writing to the outside.
Insert picture description here

Read operation flow

  The flow chart is similar to the above picture. The client reads two copies of data from the three copies and considers that the read is successful. Assuming that the update of the three copies is as follows, copy 1 and copy 2 are written successfully, and copy 3 fails to write. At this time, the client The end believes that the writing is successful.

  Read the data at this time, because you need to read at least two copies of the data to be successful before it is considered successful. At this time, 3 copies of the read situation are randomly selected 2, so the possible combination can only be (copy 1, copy 2) (copy 1, copy 3) (copy 2, copy 3), you will find that you can read at least one value of copy 1 or copy 2, so you will definitely read the updated data, so as to ensure that the client can see the latest The written data.
Insert picture description here

Vector Clock example explanation

  First of all, it is a vector clock, each component of the vector is (node:version), node is the node of the distributed system, version is the version number on the corresponding node, and the corresponding version on the same node when there is a write request Will increase. When synchronizing data between nodes, the (node:version) combination information will also be brought along. At this time, by comparing the size of these vectors, the timing problem of the write operation is determined. For example (node1:v1) compared to (node1:v2) or (node1:v1,node2:v1), then the data of (node1:v1) is obviously old data.

  Below, take three copies as an example to explain the data writing process and the vector clock process:

time operating result
1:00 Client 1 writes data to node1, such as age=1 At this time node1 has a clock vector version node1: (nod1, v1)
2:00 Client 2 writes data to node1, such as age=2 At this time, the data on node1 is updated to node1: (node1, v2)
3:00 nod1 needs to synchronize data age=1 to node3, After node3 receives the data, the data is updated to node3: (node1:v2)
4:00 nod1 needs to synchronize data age=1 to node2 After node2 receives the data and completes the update, the data becomes node2: (node1:v2)
5:00 Client 3 updated age=3 on node2 At this time, after node2 updates the data successfully, the data becomes node2: (node1:v2,node2:v1)
6:00 Client 4 updated age=4 on node3 At this time, after node2 updates the data successfully, the data becomes node3: (node1:v2,node3:v1)

  The final data distribution of each node is as follows

node Data distribution
node1 (nod1:v2)
node2 (node1:v2, node2:v1)
node3 (node1:v2, node3:v1)

node1 (nod1:v2)
node2:(node1:v2, node2:v1)
node3:(node1:v2, node3:v1)

  At this time, let's analyze the reading process: According to the NRW rule, in the case of NRW=3:2:2, you need to read two successful readings to be successful. If the data of node1 and node2 are read, the data of node2 must be new data, and the data of node1 and node3 must be read, then the data of node3 must be new data.
If you read the data of node2 and node3, you cannot judge which update. At this time, there is a so-called version conflict. At this time, the client needs to judge by itself, for example, you can make a simple judgment based on the timestamp of the data.

Vector Clock is summarized as follows

  • Can not solve the problem of all scenarios, the client needs to make decisions when necessary.
  • The vector group will grow with the increase of nodes, and it needs to be merged and shortened regularly.

Guess you like

Origin blog.csdn.net/songguangfan/article/details/114600189