elasticsearch using the seven principles, Easy Fun ES

elasticsearch using the seven principles, Easy Fun ES

Ants brother 2019-05-20 14:46:33 Internet

First, select the hardware environment

If the conditions, as far as possible the use of SSD hard drive, a good CPU. Powerful is that the characteristics of ES ES distributed architecture itself and the lucene; enhance the IO will greatly improve the speed and performance of the ES; configuration memory, in general, preferred node 64G memory machines.

Second, the system topology design

ES cluster when the topology architecture, generally using Hot-Warm architectural patterns, i.e. provided three different types of nodes: Master node, and the node Warm Hot node.

Master node settings : usually provided with three dedicated maste node to provide the best flexibility scalability. Of course, care must be taken

Set discovery.zen.minimum_master_nodes property, to prevent split-brain problem using the equation set: N / 2 + 1 (N is the master candidate nodes). The node remains: node.data: false; because the master node does not participate in the query, the index operation, only responsible for cluster management, so the CPU, memory, disk configuration, can be much lower than the data node.

Hot node settings : inode (write node), while maintaining the recent frequent use of the index. As IO and CPU-intensive operations, recommended SSD disk type, good writing performance; number of nodes arranged generally three or more. Hot node set type:

node.attr.box_type:hot

For index, by setting

index.routing.allocation.require.box_type: hot you can set the hot writing the index node.

Warm node settings : read-only index for infrequently accessed. Because they do not often visit, you can generally use an ordinary disk. Memory, CPU is arranged to be consistent with Hot node; number of nodes are generally three or more.

When the index is no longer frequent inquiries by

index.routing.allocation.require.box_type: warm, the index marks as warm, so as to ensure the index is not written hot node, so the SSD disk resources wisely. Once set this property, ES automatically incorporated into the warm index node. At the same time, may be provided in elasticsearch.yml the index.codec: Compression best_compression guaranteed warm node configuration.

Coordinating nodes : coordinator node in a distributed coordinated to make, after each slice or return data integration node returns. In ES cluster, all nodes are likely to be the coordinator node, however, can set node.master, node.data, node.ingest have to set up a special coordinating node is false. The need for better CPU and higher memory.

Three, ES memory settings

Since the ES building based on lucene, and the power of lucene lucene design that can be a good use of the operating system memory to cache the index data to provide fast query performance. lucene index files segements in single file, and immutable storage for the OS, it can be very friendly to the index file will remain in the cache for quick access; therefore, it is necessary we will leave half of the physical memory to Lucene; leaving the other half of physical memory ES (JVM heap). So, in terms of ES memory settings, you can follow the following principles:

1. When the machine memory is less than 64G, follows the general rule, ES 50% to 50% left lucene.

2. When the machine memory is greater than 64G, follow these guidelines:

. A If the main usage scenario is full-text search, it is recommended to ES Heap to allocate memory of 4 ~ 32G; other memory reserved for the operating system, for use lucene (segments cache), to provide faster query performance.

b. If the primary usage scenarios are polymeric or sorting, and most numerics, dates, geo_points and not_analyzed character type, recommended for allocation to allocate memory ES Heap 4 ~ 32G to the other operating system memory left for lucene use (doc values ​​cache), provides quick document-based clustering, sorting performance.

c. If the scene is used to sort or polymerization, and are based on character data analyzed, it needs more time heap size, the recommended running multiple instances ES machine, each instance remains no more than 50% of the ES heap provided (but not more than 32G, 32G or less heap memory settings, JVM using the object index compression techniques to save space), more than 50% left lucene.

3. Do not swap, swap memory and disk once allowed, will cause a fatal performance issues.

By bootstrap.memory_lock in elasticsearch.yml in: true, in order to keep the JVM memory lock, ensure the performance of the ES.

4. GC settings principles:

. A GC to keep the existing settings, the default setting is: Concurrent-Mark and Sweep (CMS), do not replace G1GC, because the current G1, there are many BUG.

. B keep the existing settings of the thread pool, thread pool currently ES 1.X had more than optimization settings, you can maintain the status quo; the default thread pool size equal to the number of CPU cores. If you must change, according to the formula ((CPU cores * 3) / 2) + 1 set; no more than 2 times the number of CPU cores; it is not recommended to modify the default configuration, otherwise the CPU will cause bruising.

Fourth, the cluster fragmentation is provided

ES Once the index is created, you can not adjust the settings fragmentation, but in the ES, in fact, a slice corresponding to a lucene index, and reading and writing lucene index will take up a lot of system resources, therefore, can not set the number of fragments too large; therefore, when you create an index, the rational allocation of the number of fragments is very important. In general, we follow a few guidelines:

1 controls the hard disk capacity of each slice stack occupies no more space than the maximum ES JVM is provided (disposed generally not more than 32G, to participate in the above principles set JVM), and therefore, if the index of the total capacity of about 500G, that points sheet size to about 16; Of course, taking into account the principle of preferably 2.

2. Consider the number of node, a node sometimes is generally a single physical machine, if the number of fragments too much, much higher than the number of nodes, is likely to lead to multiple slices on a node once the node failure, even if keeping more than one copy of the same may lead to loss of data, the cluster can not be recovered. Therefore, generally set number of fragments is not more than three times the number of nodes.

Five, Mapping Modeling

1. Avoid the use of nested or parent / child, can not to do; nested Query slow, parent / child query slower, a hundred times slower than the nested Query; mapping it is possible to get in the design stage (or table design uses a large wide comparison smart data structure), do not use mapping parent-child relationship.

2. If you must use nested fields, nested fields to ensure the field can not be too much, the current ES default limit is 50. reference:

index.mapping.nested_fields.limit :50

Because for a document, each nested field, will generate a separate document, which will enable dramatic increase in the number of Doc, affect the query efficiency, especially JOIN efficiency.

3. Avoid using dynamic values ​​for the field (key), increasing the dynamic mapping, will lead to the collapse of the cluster; Similarly, the need to control the number of fields, field service do not use, do not index. Control the number of index fields, depth Mapping type, index fields, optimization of the performance of the ES is the most important. The following is an ES on the number of fields, mapping the depth of some of the default settings:

index.mapping.nested_objects.limit :10000

index.mapping.total_fields.limit:1000

index.mapping.depth.limit: 20

Sixth, index optimization settings

1. refresh_interval set to -1, while number_of_replicas set to 0, by closing the refresh interval period, while the copy is not provided to improve the write performance.

2. Modify index_buffer_size settings can be set as a percentage, can also be set to a specific size, the size of the test can do different settings depending on the size of the cluster.

indices.memory.index_buffer_size:10%(默认)

indices.memory.min_index_buffer_size: 48mb(默认)

indices.memory.max_index_buffer_size

3. Modify the translog related settings:

a. the operating frequency of the control data from memory to the hard drive, to reduce the disk IO. sync_interval time can be set larger.

index.translog.sync_interval: 5s (default)

B. tranlog control the size of the data block, the size of the threshold is reached, will be flush to lucene index file.

index.translog.flush_threshold_size:512mb(默认)

4. _id field of use should be avoided as far as possible custom _id, in order to avoid versioning for ID's; ES recommended default ID generation strategy or the type of digital ID to use to do the primary key.

5. _all field and use _source field, should pay attention to the scene and need, _all field contains all of the index field, easy to do full-text search, if no such demand, you can disable; _source stored in the original document content, if not acquired demand original document data, may be defined by setting includes, excludes properties into _source field.

6. The use of reasonable configuration index property, analyzed and not_analyzed, according to business needs to control the field, regardless of whether the word or words. Groupby field only needs, when configured to set not_analyzed, in order to improve the efficiency of queries or clustering.

Seven, query optimization

1. The more query_string or multi_match the query field, the slower the query. In the mapping phase can use copy_to property values ​​for multi-column index to a new field, when multi_match, with a new field of inquiry.

2. The query date fields, especially in fact does not exist query cache with now, so you can from a business point of view it really necessary to use now, after all, is the use of query cache can greatly improve query efficiency.

3. The query result set size can not be arbitrarily set larger ridiculously value can not be set as query.setSize Integer.MAX_VALUE, because the internal ES need to create a data structure to put the result set of data specified size.

4. Try to avoid using script, as a last resort if you want to use, choose painless & experssions engine. Once the query using the script, must pay attention to the control returns, do not have an infinite loop (the following error example), because there is no timeout ES control scripts that run as long as the current script have not been executed, the query has been blocked.

elasticsearch using the seven principles, Easy Fun ES

 

5. Avoid too deep level of aggregate queries, level too deep group by, will lead to memory, CPU consumption, it is recommended to assemble the business service layer through the program, pipeline way can also be optimized.

6. multiplexed pre-index data AGG way to improve performance: as an alternative range aggregations by terms aggregations, such as to be grouped according to age group objectives are: juvenile (under 14 years) Youth (14-28), middle-aged (29-50 ) older (51 or more), may be provided at the time of a field index age_group, classify the data in advance. So do not do according to age range aggregations, by age_group field on it.

7. Cache setup and use:

a.QueryCache: When ES query, use the filter query using the query cache, if the business scene more filtering query, it is recommended querycache set larger, in order to speed up the search.

indices.queries.cache.size: 10% (default), may be provided as a percentage, it may be set to a specific value, such as 256mb.

Of course, you can disable the query cache (is enabled by default), by index.queries.cache.enabled: false setting.

b.FieldDataCache: when sorting or clustering, field data cache will be used frequently, and therefore, in the field data cache size, it is necessary in the case of more clusters or ordering scenarios, by

indices.fielddata.cache.size: 30% or 10GB specific value set. But if the scene or data changes frequently, set the cache is not a good idea, because the overhead of cache loader is particularly large.

c.ShardRequestCache: After initiating inquiry request, each slice will return the results to the coordination node (Coordinating Node), the integration result by the coordinator node.

If there is a demand, you can set to open; by setting

index.requests.cache.enable: true to open.

However, shard request cache caches only the hits.total, aggregations, suggestions types of data, and does not cache content of hits. You can also set

indices.requests.cache.size:. 1% (default) to control the size of the cache space.

Guess you like

Origin blog.csdn.net/u013322876/article/details/90573535