Article Directory

table of Contents
The core terminology of etcd
etcd software architecture
etcd data model
- Logical view
- Physical view
etcd's Proxy mode
- Why does Proxy mode replace Standby mode?

The core terminology of etcd

Raft : The algorithm used by etcd to ensure strong consistency of distributed system data.
Node : an instance of Raft state machine.
Member : An etcd instance, which manages a Node and can provide services for client requests.
Cluster : An etcd cluster that can work together is formed by multiple members.
Peer : The name for another Member in the same etcd cluster.
Client : The client that sends HTTP requests to the etcd cluster.
WAL : Write-ahead log, the log format used by etcd for persistent storage.
Snapshot : Snapshot set by etcd to prevent too many WAL files and store etcd data status.
Entry : An entry in the log in the Raft algorithm.
Proxy : A mode of etcd that provides reverse proxy services for etcd clusters.
Leader : The node that processes all data submissions generated by election in the Raft algorithm.
Follower : The node that fails the election in the Raft algorithm is used as a subordinate node to provide a strong consistency guarantee for the algorithm.
Candidate : When the follower does not receive the leader's heartbeat for more than a certain period of time (it is considered that the leader has malfunctioned), it will switch to a candidate and start the election.
Term : The time when a node becomes the leader to the next election is called a term.
Vote : A vote in the election.
Index : Data item number. In Raft, use Term and Index to locate data.
Commit : a commit, the persistent data is written to the log.
Propose : A proposal to request that most Nodes agree to write data.

etcd software architecture

Insert picture description here

HTTP Server : Accepts API requests from clients and requests for synchronization and heartbeat information from other etcd nodes.
Store : Used to process various functions supported by etcd, including data indexing, node status changes, monitoring and feedback, event processing and execution, etc. It is the specific implementation of most API functions provided by users by etcd.
Raft : The specific implementation of the strong consistency algorithm is the core algorithm of etcd.
WAL (Write Ahead Log) : It is the data storage method of etcd. etcd will store the state of all data and the index of the node in memory. In addition, etcd will also perform persistent storage through WAL. In WAL, all data will be logged in advance before submission.
- Snapshot is a state snapshot to prevent excessive data;
- Entry represents the specific log content stored.

Usually, a user's request is sent, and it will be forwarded to the Store via HTTP Server for specific transaction processing. If it involves the modification of node data, it will be handed over to the Raft module for status changes and log records, and then synchronized to others The etcd node confirms the data submission, and finally submits the data and synchronizes again.

Raft

For the new version of etcd, the Raft package is the specific implementation of the Raft consensus algorithm.

What does a term in Raft mean?

In the Raft algorithm, in terms of time, a term (term) is from the beginning of one election to the beginning of the next election. From a functional point of view, if the follower does not receive the leader's heartbeat information, it will end the current term, become a candidate and then initiate a campaign, and then help the recovery of the cluster when the leader fails. When voting for elections is initiated, Nodes with a small term value will not be successful. If the Cluster does not fail, then a Term will continue indefinitely. In addition, conflicts in voting may also directly enter the next election.

Insert picture description here

How does the Raft state machine switch?

When Raft first started running, Node would enter the follower state by default, waiting for the leader to send heartbeat information. If the wait times out, the status will switch from Follower to Candidate and enter the next round of Term to initiate the election. When the vote of the "majority node" of the Cluster is received, the Node will be changed to Leader. The leader may have network failures, causing other Nodes to vote to become the leader of the new term. At this time, the original Old Leader will be switched to Follower. Candidate will switch to Follower if it finds that a leader has been successfully elected while waiting for other Nodes to vote.

Insert picture description here

How to ensure that the leader is elected in the shortest time and prevent conflicts in elections?

In the picture of the Raft state machine, you can see that in the Candidate state, there is a times out, where the times out time is a random value, that is, after each Node becomes a Candidate, times out is the time to initiate a new round of elections Each is different, there will be a time difference. Within the time difference, if the candidate information received by Candidate1 is greater than the term value of the information initiated by the candidate (that is, the other party is a new round of Term), and the candidate 2 who wants to become the leader in the new round contains all the submitted data, then Candidate1 Will vote for Candidate2. This ensures that there is only a small probability that there will be election conflicts.

How to prevent other Candidates from initiating a vote to become a leader when some data is missing?

In the Raft election mechanism, a random value is used to determine the times out. The first node that expires will raise the Term number to initiate a new round of voting. Generally, other nodes will vote when they receive the election notice. However, if the submitted data saved by the Node that initiated the election is incomplete in the previous Term, Node will refuse to vote for it. Through this mechanism, it is possible to prevent the Node that missed data from becoming the leader.

What happens if a node of Raft goes down?

Normally, if the follower goes down, if the number of remaining available nodes exceeds half, the cluster can work normally without any impact. If the leader is down, then Follower will not receive the heartbeat and overtime, initiate a campaign to get votes, become a new round of Term Leader, and continue to provide services for Cluster.

It should be noted that currently, etcd does not have any mechanism to automatically change the Instances (total number of nodes) of the entire Cluster, that is: if there is no artificial call to the API, the Node after etcd downtime is still counted as the total number of nodes, any request The number of votes required to be confirmed is more than half of this total.

Insert picture description here

Why does the Raft algorithm need not consider the Byzantine Generals problem when determining the number of available nodes?

In the Byzantine problem, a distributed architecture that allows n nodes to be down and still provides normal services requires a total number of nodes of 3n+1. Raft only needs 2n+1. The main reason is that there is data fraud in the Byzantine Generals problem, while etcd assumes that all Nodes are honest. Before the election, etcd needs to tell other Nodes its own Term number and the Index value at the end of the previous round of Term. These data are accurate, and other Nodes can decide whether to vote based on these values. In addition, etcd strictly restricts the flow of data from Leader to Follower to ensure that the data is consistent and error-free.

Which node in the cluster does the client read and write data from?

In order to ensure strong data consistency, the data flow in etcd Cluster is from Leader to Follower, that is, all Follower data must be consistent with Leader, if inconsistent, it will be overwritten.

That is, all user requests to update data are first obtained by Leader, and then other nodes are notified to also update, and when the "most nodes" feedback, the data will be submitted at once. A submitted data item is the data item that Raft has truly stored stably and will no longer be modified. Finally, the submitted data will be synchronized to other Followers. Because each Node has an accurate backup of Raft's submitted data (the worst case is that the submitted data has not been fully synchronized), any node can process the read request.

In fact, users can read and write to any Node in etcd Cluster:

Read : You can read from any Node because the data saved by each node is strongly consistent.
Write : etcd Cluster will first elect the leader. If the write request comes from the leader, it can be written directly, and then the leader will distribute the write to all followers; if the write request comes from other follower nodes, the write request will be forwarded to The Leader node is written by the Leader node and then distributed to all other nodes in the cluster.

How to ensure data consistency?

etcd uses the Raft protocol to maintain the consistency of the status of each Node in the Cluster. Simply put, etcd Cluster is a distributed system. Multiple Nodes communicate with each other to form an overall external service. Each Node stores complete data, and the Raft protocol ensures that the data maintained by each Node is consistent.

Each Node in etcd Cluster maintains a state machine, and at any time, there is at most one effective master node in the Cluster, namely: Leader Node. The Leader handles all write operations from the client, and the Raft protocol ensures that the changes to the state machine by the write operation will be reliably synchronized to other Follower Nodes.

How to elect Leader node?

Suppose there are 3 Nodes in etcd Cluster, and there is no leader elected when the Cluster is started. At this time, the Raft algorithm uses a random Timer to initialize the leader election process. For example, the above 3 Nodes are running Timer (the duration of each Timer is random), and Node1 is the first to complete the Timer, and then it will send a request to become Leader to the other two Nodes, and the other Nodes receive the request Later, it will respond with a vote and the first node will be elected as the leader.

After becoming the leader, the node will send notifications to other nodes at regular intervals to ensure that it is still the leader. In some cases, when followers do not receive the leader's notification, for example, the leader node is down or loses connection, other nodes will repeat the previous election process and re-elect a new leader.

How to judge whether the writing is successful?

etcd believes that after the write request is processed by the Leader and distributed to other "majority nodes", it is a successful write. The formula for calculating the number of "majority nodes" is Quorum=N/2+1, where N is the number of summary points. In other words, etcd concurrently writes data to all nodes as one write, but writes to "majority nodes".

How to determine the number of nodes in etcd Cluster?

Insert picture description here

The left side of the above figure shows the relationship of Quorum (number of quorums) corresponding to Instances (total number of nodes) in the cluster. Instances-Quorom obtains the number of fault-tolerant nodes (nodes that can tolerate failure) in the cluster.

Therefore, the recommended minimum number of nodes in etcd Cluster is 3, because the number of fault-tolerant nodes for 1 and 2 Instances is 0. Once one node goes down, the entire cluster will not work properly.

Furthermore, when we need to determine the number of Instances in etcd Cluster, we strongly recommend an odd number of nodes, such as 3, 5, 7, ..., because the fault tolerance of a 6-node cluster is not better than that of 5 nodes , Their number of fault-tolerant nodes is the same. Once the number of fault-tolerant nodes exceeds 2, because the number of Quorum nodes is less than 4, the entire cluster becomes unavailable.

Insert picture description here

What is the performance of the Raft algorithm implemented by etcd?

A single instance node supports 2000 data writes per second. The greater the number of nodes, the slower and slower the data synchronization involves network delays according to the actual situation, and the stronger the read performance, because each node can handle user requests.

Store

Store, as the name suggests, is the underlying logic implemented by etcd and provides corresponding API support. To understand Store, you only need to start with etcd's API. The most common CURD API calls are listed below.

Assign values to the keys stored by etcd:

curl http://127.0.0.1:2379/v2/keys/message -X PUT -d value="Hello world"

{
    "action":"set",                     # 执行的操作
    "node":{
        "createdIndex":2,           # etcd Node 每次有变化时都会自增的一个值
        "key":"/message",          # 请求路径
        "modifiedIndex":2,          # 类似 node.createdIndex，能引起 modifiedIndex 变化的操作包括 set, delete, update, create, compareAndSwap and compareAndDelete
        "value":"Hello world"      # 存储的内容
    }
}

Query the value stored by a key in etcd:

curl http://127.0.0.1:2379/v2/keys/message -X GET

Modify key value:

curl http://127.0.0.1:2379/v2/keys/message -XPUT -d value="Hello etcd"

Delete key value:

curl http://127.0.0.1:2379/v2/keys/message -XDELETE

WAL

The data storage of etcd is divided into two parts:

Memory storage : In addition to the sequential records of all users' changes to the node data, the storage in the memory will also perform operations such as indexing and stacking of user data for easy query.
Persistent (hard disk) storage : Persistence uses WAL (Write Ahead Log, write-ahead log) for record storage.

Insert picture description here

The WAL log is binary, and the above data structure LogEntry is parsed out. among them:

The first field type, there are only two types:

0 means Normal
1 means ConfChange, and ConfChange means synchronization of etcd's configuration changes, such as new nodes joining.

The second field is term, each term represents the term of a leader, and it will change every time the leader changes the term.

The third field is index. This serial number is strictly and sequentially increasing, representing the change serial number.

The fourth field is binary data, which saves the entire pb structure of the Raft Request object.

There is a tools/etcd-dump-logs script tool under the etcd source code, which can dump WAL logs into text for viewing, and can assist in analyzing the Raft protocol.

The Raft protocol itself does not care about the application data, that is, the part of the data. The consistency is achieved by synchronizing the WAL log. Each Node applies the data received from the Leader to the local storage. Raft only cares about the synchronization status of the log. There are bugs in the local storage implementation, such as failing to correctly apply data locally, which may also cause data inconsistency.

In the WAL system, all data will be logged before submission. In the persistent storage directory of etcd, there are two subdirectories:

One is WAL: stores the change records of all transactions;
The other is Snapshot: it stores the data of all directories in etcd at a certain time.

Through the combination of WAL and Snapshot, etcd can effectively perform operations such as data storage and node failure recovery.

Why do I need Snapshot?

Because with the increase in usage, the data stored by WAL will increase sharply. In order to prevent the disk from filling up quickly, etcd takes a snapshot every 10,000 records by default, and the WAL files after the snapshot can be deleted. Therefore, the default operation history records that can be queried through the API are 1000.

At the first startup, etcd will store the startup configuration information in the directory path specified by the data-dir configuration item. The configuration information includes Local Node ID, Cluster ID, and initial cluster information. Users need to avoid etcd restarting from an expired data directory, because a Node started with an expired data directory will be inconsistent with other Nodes in the Cluster. For example, it has been recorded and agreed that the Leader Node will store certain information before restarting. Apply for this information to Leader Node again. Therefore, in order to maximize the security of the cluster, once there is any possibility of data corruption or loss, you should remove this Node from the Cluster, and then add a New Node without a data directory.

The biggest function of WAL (Write Ahead Log) is to record the entire history of the entire data change. In etcd, all data changes must be written to WAL before submission. Using WAL for data storage enables etcd to have two important functions:

Fast recovery from failure : When your data is damaged, you can quickly recover from the original data to the state before the data damage by performing all the modification operations recorded in the WAL.
Data rollback (undo) or redo (redo) : Because all modification operations are recorded in the WAL, rollback or redo is required. You only need to perform the operations in the log in the direction or forward direction.

What are the naming rules for WAL and Snapshot?

, WAL files in the directory etcd data $seq-$index.walstorage format. The initial WAL file is 0000000000000000-0000000000000000.wal, which means this is the 0th among all WAL files, and the initial Raft state number is 0. After running for a period of time, it may be necessary to perform log segmentation and put new entries in a new WAL file.

Suppose, when the cluster runs to the Raft state of 20, when the WAL file needs to be segmented, the next WAL file will become 0000000000000001-0000000000000021.wal. If another log segmentation is performed after 10 operations, the WAL file name of the next time will become 0000000000000002-0000000000000031.wal. It can be seen that the number in front of the "-" symbol is incremented by 1 after each segmentation, and the number after the "-" symbol is determined according to the actual stored Raft initial state.

Snapshot storage and named it easier to understand, to $term-$index.walformat named storage. The term and index represent the state of the Raft node where the data is stored when the snapshot is stored, the current term number and the location of the data item.

etcd data model

etcd is designed to store infrequently updated data and provide a reliable Watch plug-in, which exposes the historical version of key-value pairs to support low-cost snapshots and monitor historical events. These design goals require it to use a persistent, multi-version, and concurrent data model.

When the new version of etcd key-value pair is saved, the previous version still exists. In effect, key-value pairs are immutable, and etcd will not perform in-place update operations on them, but always generate a new data structure. In order to prevent unlimited increase of historical versions, etcd storage supports compression (Compact) and delete old versions.

Logical view

From a logical point of view, the storage of etcd is a flat binary key space, and the key space has a lexicographic index for the key (byte string), so the cost of range query is lower.

The key space maintains multiple revisions (Revisions), and each atomic change operation (a transaction can be composed of multiple sub-operations) will generate a new revision. Throughout the life cycle of the cluster, revisions are monotonically increasing. Revisions also support indexing, so range scans based on revisions are also efficient. The compression operation needs to specify a revision number, and revisions smaller than it will be removed.

A key's life cycle (from creation to deletion) is called "Generation", and each key can have multiple generations. When a key is created, the version of the key (Version) is increased. If the key does not exist in the current revision, the version is set to 1. Deleting a key will create a tombstone, set the version to 0, and end the current generation. Each time the value of the key is modified, its version number will be increased, that is, the version number is monotonically increasing in the same generation.

When compressing, any generation that ended before the compressed revision will be removed. The modification record with the value before the revision (only the last one) will be removed.

Physical view

etcd stores data in a persistent B+ tree. For efficiency reasons, each revision only stores the data state changes (Delta) relative to the previous revision. A single revision may contain multiple keys in the B+ tree.

The key of the key-value pair is a triple (Major, Sub, Type):

Major: Stores the revision of the key value.
Sub: Used to distinguish different keys in the same revision.
Type: Optional suffix used for special values, for example, t means the value contains a tombstone

The value of the key-value pair, including the Delta from the previous revision. B+ tree, that is, the lexical byte order of the keys, the scanning speed based on the range of the revision is fast, and it is convenient to find the value change from one revision to another.

etcd also maintains a B-tree index in memory to speed up range scans for keys. The key of the index is the user-oriented mapping of the physical storage key, and the value of the index is a pointer to the point of the B+ tree.

etcd's Proxy mode

Proxy mode, namely: etcd acts as a reverse proxy to forward client requests to the available etcd Cluster. In this way, you can deploy a proxy mode etcd as a local service on each machine. If these etcd Proxy can run normally, then your service discovery must be stable and reliable.

Therefore, the Proxy mode is not directly added to the etcd Cluster conforming to strong consistency. Similarly, Proxy does not increase the reliability of the cluster, and certainly does not reduce the write performance of the cluster.

Insert picture description here

Why does Proxy mode replace Standby mode?

In fact, every time etcd adds a core node (Peer), it will increase the burden of the Leader, including the network, CPU, and disk to a certain extent, because each information change requires a synchronous backup. Increasing the core nodes of etcd can make the entire cluster more reliable, but when the number reaches a certain level, the benefits of increasing reliability become less obvious, but it reduces the performance of cluster write synchronization. Therefore, adding a lightweight Proxy mode etcd Node is an effective replacement for directly adding etcd core nodes.

Proxy mode actually replaces the original Standby mode. In addition to the function of the forwarding agent, the Standby mode will switch from the Standby mode to the normal node mode when the number of core nodes is insufficient due to a failure. When the failed node recovers, it is found that the number of core nodes in etcd has reached the preset value, and it will switch to Standby mode.

However, in the new version of etcd, only when the number of core nodes is found to meet the requirements when the etcd Cluster is initially started, the Proxy mode is automatically enabled, and vice versa. The main reasons are as follows:

etcd is a component used to ensure high availability, so the system resources it needs, including memory, hard disk, and CPU, should be fully guaranteed to ensure high availability. Let the automatic transformation of the cluster change the core node at will, and the machine cannot guarantee the performance. Therefore, etcd officially encourages everyone to prepare dedicated machine clusters for running etcd in large clusters.
Because the etcd cluster supports high availability, some machine failures will not cause functional failure. Therefore, when the machine fails, the administrator has sufficient time to inspect and repair the machine.
Automatic conversion makes etcd clusters complicated, especially now that etcd supports monitoring and interaction in multiple network environments. Switching between different networks is more prone to errors, leading to unstable clusters.

etcd-architecture principle

table of Contents

Article Directory

The core terminology of etcd

etcd software architecture

Raft

What does a term in Raft mean?

How does the Raft state machine switch?

How to ensure that the leader is elected in the shortest time and prevent conflicts in elections?

How to prevent other Candidates from initiating a vote to become a leader when some data is missing?

What happens if a node of Raft goes down?

Why does the Raft algorithm need not consider the Byzantine Generals problem when determining the number of available nodes?

Which node in the cluster does the client read and write data from?

How to ensure data consistency?

How to elect Leader node?

How to judge whether the writing is successful?

How to determine the number of nodes in etcd Cluster?

What is the performance of the Raft algorithm implemented by etcd?

Store

WAL

Why do I need Snapshot?

What are the naming rules for WAL and Snapshot?

etcd data model

Logical view

Physical view

etcd's Proxy mode

Why does Proxy mode replace Standby mode?

Guess you like