ELK--Elasticsearch install the ik tokenizer plugin
Participle
Divide a paragraph of text into individual keywords, we will segment our own information when searching, we will segment the data in the database or the index library, and then perform a matching operation.
The default Chinese word segmentation is to treat each character as a word, which obviously does not meet the requirements, so we need to install the Chinese word segmentation device ik to solve this problem.
- Elasticsearch built-in tokenizer
- Standard-The default tokenizer, word segmentation, lowercase processing
- Simple-According to non-letter segmentation (symbols are filtered), lowercase processing
- Stop-lowercase processing, stop word filtering (the,a,is)
- Whitespace-split according to spaces, not lowercase
- Keyword-No word segmentation, direct input as output
- Patter-regular expression, default \W+ (non-character split)
- Language-provides a word segmenter for more than 30 common languages
- Customer Analyzer custom tokenizer
ik tokenizer
IK has two kinds of granularity:
ik_smart
: Will do the most coarse-grained splitik_max_word
: The text will be split at the finest granularity
1. Download the ik tokenizer
The version of the IK tokenizer requires you to install the same version of ES
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.1/elasticsearch-analysis-ik-6.4.1.zip
2. Unzip and copy the files to the es installation directory/plugin/ik
tar -xvf elasticsearch-analysis-ik-6.4.1.zip
File structure
Three, restart ElasticSearch
Four, test results
If it does not meet expectations, the ik tokenizer supports custom thesaurus. For example, here [朱葛小明] is a person's name, I can customize the thesaurus.
(1) Modify the original thesaurus and modify the default thesaurus file main.doc
(3) New thesaurus file: add new thesaurus file my.dic, add your own word segmentation to the file, pay attention to branch. Then modify the IKAnalyzer.cfg.xml file in the ik/config directory and change it to your own thesaurus file.
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!‐‐用户可以在这里配置自己的扩展字典 ‐‐>
<entry key="ext_dict">my.dic</entry>
<!‐‐用户可以在这里配置自己的扩展停止词字典‐‐>
<entry key="ext_stopwords"></entry>
</properties>
After the change, restart es to take effect.
Effect after modification: