ELK--Elasticsearch install the ik tokenizer plugin

                             ELK--Elasticsearch install the ik tokenizer plugin

Participle

Divide a paragraph of text into individual keywords, we will segment our own information when searching, we will segment the data in the database or the index library, and then perform a matching operation.
The default Chinese word segmentation is to treat each character as a word, which obviously does not meet the requirements, so we need to install the Chinese word segmentation device ik to solve this problem.

  • Elasticsearch built-in tokenizer
  • Standard-The default tokenizer, word segmentation, lowercase processing
  • Simple-According to non-letter segmentation (symbols are filtered), lowercase processing
  • Stop-lowercase processing, stop word filtering (the,a,is)
  • Whitespace-split according to spaces, not lowercase
  • Keyword-No word segmentation, direct input as output
  • Patter-regular expression, default \W+ (non-character split)
  • Language-provides a word segmenter for more than 30 common languages
  • Customer Analyzer custom tokenizer

ik tokenizer

IK has two kinds of granularity:

  1. ik_smart: Will do the most coarse-grained split
  2. ik_max_word: The text will be split at the finest granularity

1. Download the ik tokenizer

The version of the IK tokenizer requires you to install the same version of ES

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.1/elasticsearch-analysis-ik-6.4.1.zip

2. Unzip and copy the files to the es installation directory/plugin/ik

tar -xvf elasticsearch-analysis-ik-6.4.1.zip

File structure 

Three, restart ElasticSearch

 

Four, test results

If it does not meet expectations, the ik tokenizer supports custom thesaurus. For example, here [朱葛小明] is a person's name, I can customize the thesaurus.

(1) Modify the original thesaurus and modify the default thesaurus file main.doc

(3) New thesaurus file: add new thesaurus file my.dic, add your own word segmentation to the file, pay attention to branch. Then modify the IKAnalyzer.cfg.xml file in the ik/config directory and change it to your own thesaurus file.

 <properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!‐‐用户可以在这里配置自己的扩展字典 ‐‐>
    <entry key="ext_dict">my.dic</entry>
     <!‐‐用户可以在这里配置自己的扩展停止词字典‐‐>
    <entry key="ext_stopwords"></entry>
</properties>

After the change, restart es to take effect.

Effect after modification:

Guess you like

Origin blog.csdn.net/u014553029/article/details/106065465