Will elasticsearch put many types under one index will cause the query to be slow

Mainly look at the data volume
ES index optimization article mainly solves the problem from two aspects,
one is the index data process;
the other is the retrieval process. (This article mainly introduces)
the process of indexing data. I have mentioned how to create an index and import data in the above articles, but you may encounter a slow process of indexing data. In fact, understanding the principle of indexing can be targeted for optimization. The process of ES indexing has more expansion of distributed data than the indexing process of Lucene, and ES mainly uses tranlog to balance data between nodes. So from the above, I can perform the first optimization through the settings of the index:
"index.translog.flush_threshold_ops": "100000"
"index.refresh_interval": "-1",
the first of these two parameters is how many pieces of tranlog data are reached. Balance, the default is 5000, and this process is relatively time-consuming and resource-consuming. So we can increase this value or set it to -1 off, and then manually balance the tranlog. The second parameter is the refresh frequency . The default value is 120s, which means that the index is refreshed regularly during the life cycle. Once data comes in, it can be refreshed like a commit in lucene. We know that when the data is addDoucment, the data cannot be retrieved until it is committed. Therefore, you can turn it off for retrieval, manually refresh one after the initial indexing, and then modify the index.refresh_interval parameter in the index setting as required, thereby improving the efficiency of the indexing process.
In addition, if there is a copy in the ES indexing process, the data will be synchronized to the copy immediately. I personally recommend setting the number of replicas to 0 during the indexing process, and changing the number of replicas back as needed after the indexing is completed, which can also improve the indexing efficiency.
"number_of_replicas": 0
After talking about the optimization of the indexing process above, let's talk about the slow retrieval speed. In fact, the retrieval speed has a great relationship with the index quality. The quality of the index is related to many factors.
1. The number
of shards The number of shards is an indicator that is very related to the retrieval speed. If the number of shards is too small or too many, the retrieval will be slow. Too many shards will cause more files to be opened during retrieval, and will also lead to communication between multiple servers. If the number of shards is too small, the index of a single shard is too large, so the retrieval speed is slow.
Before determining the number of shards, a single-service, single-index, and single-shard test is required. For example, I created an index on an IBM-3650 machine with only one shard, and tested the retrieval speed under different data volumes. Finally, the content of a single shard was measured to be 20G.
Therefore, the number of index shards = the total amount of data / the number of single shards.
At present, our data volume is more than 400 million, and the index size is about 1.5T. Because it is document data, the single data is all before 8K. The retrieval speed is now guaranteed to be below 100ms. In special cases, it is less than 500ms, and the worst is less than 750ms when doing 200, 400, 800, 1000, 1000+ users for a long time concurrent test
. Second, the number of
copies The number of copies has a relatively large relationship with the stability of the index. How to say, if ES is abnormal If it hangs, it will often lead to the loss of shards. In order to ensure the integrity of these data, this problem can be solved by replicas. It is recommended to adjust the number of replicas immediately after the index is built and after Optimize is executed.
Everyone often has a mistake. The more copies, the faster the retrieval. This is wrong. The copy has no increase in the retrieval speed. I have done it. With the increase of the number of copies, the retrieval speed will decrease slightly, so everyone When setting the number of copies, you need to find a balance value. In addition, after setting up the replica, you may have the same retrieval twice, and different values ​​may appear. This may be due to the unbalanced tranlog or the problem of shard routing. You can use ?preference=_primary
to make the retrieval in the main shard. Carried on.
Third, word segmentation
In fact, the impact of word segmentation on the index can be large or small, depending on your own grasp. The more people think that the more thesaurus, the better the word segmentation effect and the better the index quality, but this is not the case. There are many algorithms for word segmentation, most of which are based on word lists. That is to say, the size of the vocabulary determines the index size. Therefore, there is a direct link between word segmentation and index inflation. The vocabulary should not be too many, but only the ones with strong characteristics related to the document. For example, the data of the paper is indexed. The more similar the word list of the word segmentation is to the characteristics of the paper, the smaller the number of word lists, and the size of the index can be reduced a lot under the condition of ensuring full and accurate retrieval. The index size is reduced, so the retrieval speed is also improved.
Fourth, the index segment
The index segment is the concept of segments in lucene. We know that the ES indexing process will refresh and tranlog, which means that we have less than one segment number in the indexing process. The segment number is directly related to the retrieval. The more segments the number is, the slower the retrieval will be. If the segment number is guaranteed to be 1, this will improve the retrieval speed by nearly half.
$ curl -XPOST ' http://localhost:9200/twitter/_optimize?
max_num_segments =1'
5. Delete a document Delete a document
Delete a document in Lucene, the data will not be deleted on the hard disk immediately, and a new one will be generated in the lucene index .del file, and this part of the data will also participate in the retrieval process. Lucene will determine whether it is deleted during the retrieval process, and if it is deleted, it will be filtered out. This will also reduce the retrieval efficiency. So it is possible to perform a cleanup to delete the document.
$ curl -XPOST ' http://localhost:9200/twitter/_optimize?
only_expunge_deletes=true

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326219159&siteId=291194637