Analysis of Elasticsearch Implementation Principle-2

introduce

The first part analyzes the basic implementation principles of Elasticsearch in terms of reading, writing, updating, and storage. This document mainly introduces how Elasticsearch realizes the three characteristics of distributed systems (consensus, concurrency and consistency), and the internal concept of sharding. For example: translog (Write Ahead Log - WAL) and Lucene segments.
This chapter mainly covers the following topics:
Consensus: Split-brain and quorum mechanisms
Concurrency (concurrency)
Consistency (consistency): ensuring consistent writes and reads
Translog (Write Ahead Log — WAL)
Lucene segments

Consensus (split brain and quorum mechanism)

Consensus is a fundamental challenge for distributed systems. It requires all processes/nodes in a distributed system to agree on the value/state of a given data. There are many algorithms of Consensus, such as Raft, Paxos, etc. These algorithms are proven mathematically efficient, but Elasticsearch implements its own consensus system (zen discovery) for the reasons described by Shay Banon (Elasticsearch creator).

The zen discovery module has two parts:
Ping: The node that performs this process is used to discover each other.
Unicast: This module contains a list of hostnames to control which node to ping.

Elasticsearch is a peer-to-peer system where all nodes communicate with each other and there is only one active master node that updates and controls cluster state and operations. The new Elasticsearch cluster will have an election that runs in the nodes as part of the ping process. Among all the nodes eligible for election, one node is elected as the master node, and other nodes join the master node.

The default ping_interval is 1 second and ping_timeout is 3 seconds. When nodes join, they will send a join request to the master host, the request has a join_timeout time, the default is 20 times the parameter ping_timeout.

If the master node fails, points in the cluster start pinging again, starting another new election. This ping process will also help if a node unexpectedly thinks the master is down and the master is discovered by other nodes.

Note: By default, clients and data nodes do not contribute to the election process. This can be done by changing the settings of the following parameters in the elasticsearch.yml file:

 discovery.zen.master_election.filter_client:False
 discovery.zen.master_election.filter_data: False

The process of failure detection is as follows: the master node pings all nodes to check whether these nodes are alive, and all nodes ping the master node back to report that they are alive.

If the default configuration is used, Elasticsearch will encounter a split-brain problem. If it is a network partition, the node can consider the master node to be dead and elect itself as the master node, resulting in a cluster with multiple master nodes. This may result in data loss, which may not be properly merged. This situation can be avoided by setting the following properties to the quorum of master eligible nodes:

discovery.zen.minimum_master_nodes = int(# of master eligible nodes/2)+1

without

This parameter requires a quorum of nodes that can participate in the election to join the new election process to complete the new election process and accept the new node as the new master node. This is an extremely important parameter to ensure cluster stability and can be updated dynamically if the cluster size changes. Figures a and b show the two cases when the minimum_master_nodes property is set and not set, respectively.
Note: For a production cluster it is recommended to have 3 dedicated master nodes, only one is active at any given point in time, the other nodes are not serving any client requests.

Concurrency (concurrency control)

Elasticsearch is a distributed system that supports concurrent requests. When a create/update/delete request hits the primary shard, it is sent to the replica shard at the same time, but the order in which these requests arrive is indeterminate. In this case, Elasticsearch uses optimistic concurrency control to ensure that newer versions of documents are not overwritten by older versions.
Each indexed document will have a version number that is incremented each time the document changes. Using version numbers ensures that changes to the document are made sequentially. To ensure that updates in our application do not result in data loss, Elasticsearch's API allows you to specify which current version number the changes should apply to. If the version number specified in the request is earlier than the version number that exists in the shard, the request fails, which means that the document has been updated by another process.
How to handle failed requests? Can be controlled at the application level. There are other locking options available, you can read here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/concurrency-solutions.html
When we send concurrent requests to Elasticsearch, the next The question is - how do we keep these requests consistent? Now to answer https://en.wikipedia.org/wiki/CAP_theorem , it's not too clear, which is the next question.
without

However, we will review how to achieve consistent writes and reads using Elasticsearch.

Consistency (ensures consistent writes and reads)

For write operations, Elasticsearch supports a different consistency level than most other databases, it allows a preliminary check to see how many shards are available to allow writes. The available options are: quorum The settable values ​​are: one and all. By default, it is set to: quorum, which means that writes are only allowed when a majority of shards are available. In the case where most of the shards are available, it is still possible that for some reason writing to the replica shard fails, in which case the replica is considered bad and the shard will be rebuilt on a different node .
For read operations, new documents cannot be searched until after the refresh interval. To ensure that the search results come from the latest version of the document, you can set replication to sync (the default), and the write operation will not return until the write operation on both the primary and replica shards has completed. In this case, search requests from any shard will return results from the latest version of the document.
Even if your application sets: replication=async for faster indexing, you can use the _preference parameter, which can be set to primary for search requests. This way, querying the primary shard is a search request and ensures that the results will come from the latest version of the document.
As we understand how Elasticsearch handles consensus, concurrency, and consistency, let's look at some important concepts inside sharding that lead to some of the characteristics of Elasticsearch as a distributed search engine.

Translog

The concept of write ahead log (WAL) or transaction log (translog) has existed in the database world since relational databases were developed. Translog ensures data integrity in the event of a failure on the basic principle that expected changes must be logged and committed before actual changes to the data are committed to disk.
The Lucene index changes when new documents are indexed or old documents are updated, and those changes are committed to disk for persistence. Persistence after each write request is a very performance-intensive operation, and it operates by persisting multiple changes to disk at once. As we described in a previous blog post, a flush operation (Lucene commit) is performed by default every 30 minutes, or when the translog becomes too large (512MB by default). In this case, it is possible to lose all changes between two Lucene commits. To avoid this problem, Elasticsearch uses translog. All index/delete/update operations are first written to the translog, and after each index/delete/update operation (or every 5 seconds by default), the translog is fsync's to ensure changes are persisted. After the translog on the primary and replica shards is fsync'ed, the client receives a write acknowledgment.
In the event of a hardware failure between two Lucene commits or restarts, the Translog is replayed before the last Lucene commit to recover from any lost changes and apply all changes to the index.
It is recommended to explicitly flush the translog before restarting the Elasticsearch instance, as the startup will be faster as the translog to be replayed will be empty. The POST /_all/_flush command can be used to flush all indexes in the cluster.
With a translog flush operation, segments in the filesystem cache are committed to disk so that the index keeps changing.
Now let's take a look at the Lucene segment:

Lucene Segments
Lucene indexes consist of segments, which are themselves fully functional inverted indexes. Fragments are immutable, which allows Lucene to incrementally add new documents to the index without rebuilding the index from scratch. For each search request, all segments are searched, and each segment consumes CPU cycles, file handles, and memory. This means that the higher the number of segments, the lower the search performance.
To solve this problem, Elasticsearch merges the small segments into a larger segment (as shown in the image below), commits the new merged segment to disk, and deletes the old smaller segment.

without

The merge operation will happen automatically in the background without interrupting indexing or searching. Because segment merging can waste resources and affect search performance, Elasticsearch limits the resource usage of the merge process to get enough resources for searching.

Reference document
Anatomy of an Elasticsearch Cluster: Part II
https://blog.insightdatascience.com/anatomy-of-an-elasticsearch-cluster-part-ii-6db4e821b571

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325810278&siteId=291194637