Elasticsearch 8.X reindex source code analysis and speed-up guide

1. Reindex source code online address

To facilitate everyone's verification, the reindex github source code address is given here.

https://github.com/elastic/elasticsearch/blob/001fcfb931454d760dbccff9f4d1b8d113f8708c/server/src/main/java/org/elasticsearch/index/reindex/ReindexRequest.java

reindex FAQ:

4ca676657dfeda7a5a1b805605762835.png

16a3965852bffa23bcfe4e5be3a570cc.png

2. The essence of reindex source code

The essence of the reindex operation is to read documents from one or more source indexes and index these documents into a target index, which may also involve some transformation of the documents.

The following are the key points of the reindex operation derived from the source code:

2.1 Source and target

ReindexRequest defines the source index (from which documents are read) and the target index (into which documents are indexed).

2.2 Query and filtering

You can define a query for the source index (using the setSourceQuery method) to determine which documents should be reindexed.

That is, data that satisfies a given retrieval statement can be migrated.

2.3 Document conversion

If a script is provided, it can modify or transform the document before moving it from the source index to the target index .

2.4 Batch processing

Documents are batch-read from the source index and batch-indexed to the target index.

The size of the batch can be adjusted through the setSourceBatchSize method.

How large this value can be is not clearly stated in the source code. But we need to know the following rules!

  • Setting a very large rolling size may still stress the cluster because it increases memory usage and data transfer between cluster nodes.

  • Therefore, it is important to choose an appropriate scroll size to ensure that you achieve good performance without over-stressing the cluster.

2.5 Remote source index

0bf1afb2cd394530fe06d0f7693dbbf1.png

Reindex can not only move documents between indexes in the current Elasticsearch cluster (shown in Figure 1), but it can also read documents from a remote Elasticsearch cluster (shown in Figure 2).

60833f888d07396a58988488a539e460.png

This is achieved through the RemoteInfo class, which contains all necessary information about the remote cluster.

The information included is as follows:

  • The address of the remote cluster (may be a URL or URI)

  • Authentication information (such as username and password)

  • Request headers (specific header information customized for remote requests)

  • Connection timeout and socket timeout

  • Other configuration information needed to interact with the remote cluster

2.6 Verification

Before performing the reindex operation, a series of verification checks (using the validate method) are performed to ensure that the request is legitimate.

2.7 Serialization/Deserialization

The ReindexRequest class contains methods for serializing requests to and from network transport format.

This allows Elasticsearch nodes to communicate efficiently with each other and perform reindex requests.

2.8 Output

ReindexRequest can be converted to a descriptive string (using the toString method) or an XContent format (usually JSON, using the toXContent method), which is useful for logging and debugging.

To summarize, the essence of a reindex operation is to read documents from a source index, possibly perform some transformations, and then index those documents into a target index.

This operation can be performed between indexes in the current cluster or across clusters. This is a powerful method that can be used for tasks such as data migration, index reorganization, data transformation, etc.

3. reindex acceleration

The speed of the reindex operation is affected by several factors. If you want to speed up the reindex operation, here are some suggestions:

3.1 Adjust batch size:

ReindexRequest has a setSourceBatchSize method that allows us to set the number of documents per batch.

/**

     * Sets the scroll size for setting how many documents are to be processed in one batch during reindex

     */

    public ReindexRequest setSourceBatchSize(int size) {

        this.getSearchRequest().source().size(size);

        return this;

    }

Increasing the batch size may improve performance, but be aware that batches that are too large may cause memory issues or request timeouts.

3.2 slice parallel processing

Slices can really help speed up re-indexing operations in Elasticsearch. Slices are a way to break up a large query into smaller parts and execute them in parallel, making the overall operation faster.

In the ReindexRequest class we can see the method forSlice (TaskId slicingTask, SearchRequest slice, int totalSlices) which allows us to create a sub-slice for a given scrolling request.

How to put it into practice?

About setting the number of slices: When we perform a re-indexing operation, we can set the slices parameter to specify the number of slices we want.

For example, if we select slices: 5, then Elasticsearch will try to split the query into 5 subqueries and distribute the documents as evenly as possible .

Parallel execution speeds up

With slicing, each slice can be executed in parallel on a separate thread or node. This way, slicing can significantly increase the speed of re-indexing if we have multiple nodes or enough resources.

Actual command:

In the Elasticsearch REST API, the command to perform a re-indexing operation with slicing may be as follows:

POST _reindex
{
  "source": {
    "index": "old_index",
    "size": 1000,
    "slice": {
      "id": 0,
      "max": 5
    }
  },
  "dest": {
    "index": "new_index"
  }
}

In the above command, we divide the original index into 5 slices and use the id parameter to specify the number of the current slice.

To execute all slices in parallel, you need to run this command for each slice number (in this case, from 0 to 4).

Notes on slices

While slicing can speed up operations, it also increases the burden on the cluster because each slice creates its own scrolling context. Make sure your Elasticsearch cluster has enough resources to handle the number of tiles we choose.

The optimal number of slicing operations depends on the data, query, and cluster configuration. It may take some performance testing to find the optimal number of slices.

In general, slices can significantly improve the speed of re-indexing operations (we will verify this in a moment and prove it with facts), but you need to ensure that you use it correctly so that you increase the speed without overburdening the cluster.

3.3 Optimize query

If we use a query to filter documents in the reindex request, make sure the query is optimized. Avoid using complex or inefficient queries. For example: complex nested queries, wildcard fuzzy queries, etc. should be avoided as much as possible.

3.4 Add hardware resources

Increasing the CPU, memory, and I/O capabilities of the Elasticsearch node can increase the speed of reindex.

If we are reindexing from a remote cluster, make sure both clusters have sufficient resources.

This is suitable for situations where the amount of data is extremely large.

3.5 Optimize index settings:

Temporarily disable some features such as refreshes and replicas on the target index. After reindexing is complete, enable them again:

Set index.number_of_replicasto 0 to disable replicas .

Set index.refresh_intervalto -1 to disable refreshing .

3.6 Using Ingest Pipelines

If we are using scripts to transform documents during reindex operations, consider using Ingest Pipelines , which may be more efficient than scripts.

3.7 Network optimization

If re-indexing from a remote cluster, make sure the network connection is high-speed and low-latency . Limit the use of other network-intensive operations to ensure that reindex requests take full advantage of bandwidth.

This is a marginal suggestion, and common sense is something you must know and understand.

3.8 Restrict other operations

Try to perform reindex operations during off-peak hours for the cluster , and limit other resource-intensive operations such as large searches or other indexing operations (such as segment merging, etc.).

3.9 Check plug-ins and external scripts

Make sure there are no plugins or external scripts affecting the performance of the reindex operation.

3.10 Monitor and tune

Use Elasticsearch monitoring tools, such as the Elastic Stack monitoring function, to monitor the performance of the reindex operation. This helps us identify bottlenecks and tune accordingly.

With these recommendations in mind, it's best to test in a production environment to find the best settings and optimization strategies.

4. reindex uses slice to speed up verification

4.1 Preparation work

  • Condition 1 - Select or create a large enough data.

A large index is required for the performance difference to be noticeable. Small data sets may not show significant differences.

  • Condition 2 - Ensure cluster health.

Make sure the Elasticsearch cluster is healthy, all nodes are online, and there are no pending tasks before starting testing.

  • Condition 3 - Close other large operations.

Make sure there are no other large queries or index operations running on the cluster that could affect the performance test results.

4.2 Re-indexing without using slices

  • Record start time.

  • Use the _reindex API to perform reindexing operations without using slices.

  • Record completion time.

  • Calculate duration.

## 第一种:直接迁移。

  "took": 4005,

POST _reindex
{
  "source": {
    "index": "image_index"
  },
  "dest": {
    "index": "image_index_direct"
  }
}

GET image_index_direct/_count

4.3 Re-indexing using slices

  • Choose a number of slices: for example, if there are 5 data nodes, we might want to try 5 slices.

  • Record start time.

  • Use the _reindex API to perform a reindex operation, creating a separate request for each slice. You can use concurrency tools (such as the parallel command or script) to run all requests in parallel.

  • Record the time when all slices are completed.

  • Calculate the total duration.

## 第二种,加了并行处理!
 
POST _reindex
{
  "source": {
    "index": "image_index",
    "slice": {
      "id": 0,
      "max": 5
    }
  },
  "dest": {
    "index": "image_index_slice"
  }
}

4.4 Comparison

Compare the total time of two re-indexings. In theory, the version using slices should be faster, especially in clusters with multiple nodes and large amounts of data.

As shown in the video below, I prioritized verification on a small scale.

The data volume is 16MB, and the comparison of tens of thousands of data migration results is as follows:

Migration method time consuming
Lift and shift 4005ms
slice migration 1123ms

The data volume is 112MB, and the comparison of the data migration results of 150,000 Changjin Lake movie reviews:

Migration method time consuming
Lift and shift About 30000ms (as seen in video playback afterwards)
slice migration 10263ms

From the above reindex results of two orders of magnitude different data, it can be seen that adding slice can speed up 3-4 times !

For more node-scale clusters and large-scale data, we look forward to your feedback on the results.

5. Summary

Once you have a plan, you will be more convinced if you put it into practice and produce results!

a0c526dbb29298601cf17e4e670f8043.png

come on!

Recommended reading

  1. First release on the entire network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Breaking news | Obsessed with Elasticsearch 8.X methodology knowledge list

  3. How to learn Elasticsearch systematically?

  4. 2023, do something

  5. Dry information | Elasticsearch Reindex performance increased by 10 times + actual combat

c370ec857a0fb7eddf814fc62b8b95f0.jpeg

Learn more useful information faster and in less time!

Work together with nearly 2,000+ Elastic enthusiasts around the world!

66e844f9e4befc913c1b758b7e89710a.gif

In the era of big models, be the first to learn advanced skills!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/132463417