ElasticSearch study notes-Chapter 5 Detailed explanation of ES document analysis

Five document analysis

In Elasticsearch, the text content of the document is analyzed through an analyzer, and the text content is segmented and standardized. This process is called document analysis .

5.1 What is an analyzer?

In Elasticsearch, the analyzer is a software module that is mainly responsible for two functions: Tokenization (word segmentation) and Normalization (normalization).

  • Tokenization

    Tokenization, that is, tokenization, is the process of splitting sentences into individual words , and it follows certain rules.

  • Normalization

    Normalization, that is, normalization is the process of processing, converting, modifying, and enriching word segments (words) in the form of stemming , synonyms , stop words , and other features .

The analyzer consists of a tokenizer , zero or more character filters , and zero or more token filters. Its structure is shown in the figure below.

Insert image description here

  • character filter

    Processes the input text content, applied to the character level.

  • tokenizer

    The character stream processed by the character filter is further processed, and the text is divided into several terms according to rules.

  • term filter

    The terms segmented by the tokenizer are further processed.

Character filters, tokenizers, and term filters will be introduced in detail in subsequent chapters.

  • The overall process of analyzer work
    1. The text content must first be filtered and processed by the character filter
    2. The word segmenter processes the text content processed in the first step according to the specified rules and splits it into several terms.
    3. Further process the terms split out in the second step
    4. Add the terms processed in the third step to the inverted index.

5.2 Character filter

5.2.1 Function

Character filters are applied at the character level , and each character of text passes through these filters in order. These filters can perform the following specific functions:

  • Remove unwanted characters from the input stream

    For example, you can clear from input text

    , , and other HTML tags.

  • Add or replace other characters in an existing stream

    For example, Greek letters are replaced with equivalent English words (0 and 1 are replaced by false and true).

Elasticsearch provides three character filters that we can use to build our own custom analyzers.

5.2.2 html_strip character filter

  • effect

    The html_strip character filter strips HTML elements like and decodes HTML entities like &.

  • Example

    Use the html_strip character filter to process the following text

    <h1>html_strip & test hellow</h1>
    

    After the text passes through this character filter, only the text "html_strip & test hellow" remains. We can send get/post requests to http://localhost:9200/_analyze and add the following parameters to the request body

    {
          
          
        "text":"<h1>html_strip & test hellow</h1>", // 要解析的文本内容
        "tokenizer": "standard",                   // 使用的分词器是标准分词器,后面小节会进行讲解
        "char_filter": ["html_strip"]               // 使用的字符过滤器
    }
    

    The results can be obtained as follows:

    Insert image description here

    // 返回值
    {
          
          
        "tokens": [
            {
          
          
                "token": "html_strip",
                "start_offset": 4,
                "end_offset": 14,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
          
          
                "token": "test",
                "start_offset": 17,
                "end_offset": 21,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
          
          
                "token": "hellow",
                "start_offset": 22,
                "end_offset": 28,
                "type": "<ALPHANUM>",
                "position": 2
            }
        ]
    }
    

    As you can see from the return value above, after being processed by the character filter html_strip and the standard word segmenter, Elasticsearch splits the input text content into three words, and the HTML elements have been removed from it.

    The standard tokenizer will split the text content according to individual words and ignore some special characters such as &, which will be explained later.

  • Keep some HTML elements

    Using the html_strip character filter will remove all HTML elements from the text content. At the same time, the html_strip character filter supports retaining some HTML elements. We can use the extended attribute escaped_tags of the html_strip character filter to specify the HTML elements that need to be retained.

    For example, we enter the following text

    <pre><h1>html_strip & test hellow</h1></pre>
    

    We want the following text to remain after passing the html_strip character filter

    <pre>html_strip & test hellow</pre>
    

    At this time, we can no longer simply use the html_strip character filter, we need to customize a analyzer. ( Each parameter will be explained in detail later, here we mainly focus on the extended attribute escaped_tags of the html_strip character filter )

    Create an index using a custom analyzer

    // PUT http://localhost:9200/character-filter-test
    // body内容
    {
          
          
    	"settings": {
          
          
    		"analysis": {
          
          
    			"analyzer": {
          
           // 当前索引中的分析器集合
    				"test_html_strip_filter_analyzer": {
          
             // 自定义的分析器名称
    					"tokenizer": "keyword",
    					"char_filter": ["my_html_strip_filter"]  // 自定义的分析器使用的字符过滤器集合
    				}
    			},
    			"char_filter": {
          
            // 当前索引中的字符过滤器集合
    				"my_html_strip_filter": {
          
           // 自定义字符过滤器的名称
    					"type": "html_strip",  // 字符过滤器类型为html_strip
    					"escaped_tags": ["pre"]  // 要保留的html元素名
    				}
    			}
    		}
    	}
    }
    

    Analyze text using a custom analyzer

    // POST http://localhost:9200/character-filter-test/_analyze
    // body内容
    {
        "text":"<pre><h1>html_strip & test hellow</h1></pre>", // 要分析的文本内容
        "analyzer": "test_html_strip_filter_analyzer"  // 使用的分析器名称
    }
    

    The result is as follows:

    Insert image description here

    As can be seen from the results, there is no h1 tag in the result after word segmentation, but there is a pre tag.

    Note: The results here contain &. This is because the tokenizer in our analyzer uses keywords. This will be introduced in detail later. The focus here is just to retain the pre tag.

5.2.3 mapping character filter

  • effect

    The Mapping character filter replaces any occurrence of the specified string with the specified replacement string.

  • Example

    Replace "CN" in the entered text content with "China".

    // POST http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I come from CN",
        "tokenizer": "keyword", 
        "char_filter": [{
          
          
            "type":"mapping",
            "mappings":[
                "CN => 中国"
            ]
        }]
    }
    

    The result is as follows:

    Insert image description here

    You can see that CN in the content has been replaced by China

  • Use a custom analyzer

    In formal use, it is generally used in conjunction with a custom analyzer.

    Create index with custom mapping character filter

    // PUT http://localhost:9200/mapping-filter-test
    // body内容
    {
          
          
    	"settings": {
          
          
    		"analysis": {
          
          
    			"analyzer": {
          
           
    				"test_mapping_filter_analyzer": {
          
            
    					"tokenizer": "keyword",
    					"char_filter": ["my_mapping_filter"] 
    				}
    			},
    			"char_filter": {
          
            
    				"my_mapping_filter": {
          
           
    					"type": "mapping", 
    					"mappings": [
                            "CN => 中国",
                            "HUBEI => 湖北",
                            "WUHAN => 武汉"
                        ] 
    				}
    			}
    		}
    	}
    }
    

    Test word segmentation

    // GET http://localhost:9200/mapping-filter-test/_analyze
    // body内容
    {
          
          
        "text":"I come from WUHAN,HUBEI,CN",
        "analyzer": "test_mapping_filter_analyzer"
    }
    

    The result is as follows:

    Insert image description here

    As you can see, the entered text content has been text replaced according to the mapping we configured.

  • Use configuration files

    When we have a lot of mappings, it is inconvenient to maintain using this script method. Therefore, Elasticsearch supports reading files in the specified directory by configuring the mappings_path attribute of the mapping character filter to obtain mapping mappings.

    In the config subdirectory under the Elasticsearch installation directory, create an analysis folder, place the my_mapping.txt file in it, and maintain the corresponding mapping in the file, as shown in the figure below.

    Insert image description here
    Insert image description here

    Note: The txt file must be UTF-8 encoded , and each mapping needs to be separated by newlines .

    Create an index for testing

    // PUT http://localhost:9200/mapping-filter-test2
    // body内容
    {
          
          
    	"settings": {
          
          
    		"analysis": {
          
          
    			"analyzer": {
          
           
    				"test_mapping_filter_analyzer": {
          
            
    					"tokenizer": "keyword",
    					"char_filter": ["my_mapping_filter"] 
    				}
    			},
    			"char_filter": {
          
            
    				"my_mapping_filter": {
          
           
    					"type": "mapping", 
    					"mappings_path": "analysis/my_mapping.txt"
    				}
    			}
    		}
    	}
    }
    

    Mappings_path needs to specify the configuration file to be read. The path can be a relative path or an absolute path based on the config directory.

    // GET http://localhost:9200/mapping-filter-test2/_analyze
    // body内容
    {
          
          
        "text":"I come from WUHAN,HUBEI,CN",
        "analyzer": "test_mapping_filter_analyzer"
    }
    

    The result is as follows:

    Insert image description here

5.2.4 pattern_replace character filter

  • effect

    The pattern_replace character filter replaces any character matching a regular expression with the specified replacement.

  • Example

    Here, we directly use a custom analyzer for demonstration, replacing the phone number in the text content with the Chinese "This is a phone number".

    // PUT http://localhost:9200/pattern-replace-filter-test
    // body内容
    {
          
          
    	"settings": {
          
          
    		"analysis": {
          
          
    			"analyzer": {
          
          
    				"test_pattern_replace_filter_analyzer": {
          
          
    					"tokenizer": "keyword",
    					"char_filter": ["my_pattern_replace_filter"]
    				}
    			},
    			"char_filter": {
          
          
    				"my_pattern_replace_filter": {
          
          
    					"type": "pattern_replace",
    					"pattern": "^(13[0-9]|14[01456879]|15[0-35-9]|16[2567]|17[0-8]|18[0-9]|19[0-35-9])\\d{8}$",
    					"replacement": "这是一段电话号码"
    				}
    			}
    		}
    	}
    }
    

    test

    // GET http://localhost:9200/pattern-replace-filter-test/_analyze
    // body内容
    {
          
          
        "text":"13111111111",
        "analyzer": "test_pattern_replace_filter_analyzer"
    }
    

    The result is as follows:

    Insert image description here

5.3 Tokenizer

5.3.1 Function

The word segmenter is a means of analyzing and processing text. The basic processing logic is to divide the original text into several smaller-grained terms according to the pre-established word segmentation rules . The size of the granularity depends on the word segmenter rules .

It works after the character filter (if present).

The system has some built-in word breakers. We will introduce several commonly used word breakers in detail. For other word breakers, please see the official documentation .

5.3.2 standard tokenizer

  • effect

    This tokenizer will divide the text into several terms according to the word boundaries defined by the Unicode text segmentation algorithm , and the tokenizer will remove most of the punctuation marks.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I like steak !",
        "tokenizer": "standard"
    }
    

    The result is as follows:

    Insert image description here

    As you can see, each word is divided into one term, and the exclamation mark is removed.

5.3.3 letter tokenizer

  • effect

    This tokenizer will segment non-alphabetic text, that is, when there is non-alphabetic text between adjacent words, it will be segmented, and non-alphabetic text will also be removed.

    Note: When encountering Chinese characters, they will be processed according to the letters, that is, when encountering Chinese characters, they will not be separated.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I,like&steak和西红柿!and#you",
        "tokenizer": "letter"
    }
    

    The result is as follows:

    Insert image description here

5.3.4 whitespace tokenizer

  • effect

    This tokenizer splits when spaces are encountered.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I like&steak!",
        "tokenizer": "whitespace"
    }
    

    The result is as follows:

    Insert image description here

5.3.5 lowercase tokenizer

  • effect

    This tokenizer is similar to the letter tokenizer. It will split non-alphabetic text, and it will convert each segmented term into lowercase. This filter will also discard non-alphabetic text terms.

    It is equivalent to a combination of letter tokenizer and lowercase term filter, but has higher performance.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I,like&steak和西红柿!and#you",
        "tokenizer": "lowercase"
    }
    

    The result is as follows:

    Insert image description here

5.3.6 classic tokenizer

  • effect

    This tokenizer segments text based on grammatical rules and is very useful for segmenting English documents . It works very well on acronyms, company names, email addresses, and Internet host names.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"My mail is [email protected]",
        "tokenizer": "classic"
    }
    

    The result is as follows:

    Insert image description here

5.3.7 keyword tokenizer

  • effect

    This tokenizer is a "noop" tokenizer that takes any text given and outputs the exact same text as a single term. It can be combined with term filters to normalize output, such as lowercase email addresses.

    That is, treat the input text as a whole

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I like steak !",
        "tokenizer": "keyword"
    }
    

    The result is as follows:

    Insert image description here

5.3.8 pattern tokenizer

  • effect

    This uses regular expressions to split text into terms when word separators are matched, or to capture matching text as terms.

    The default mode is \W+, which splits text when non-word characters are encountered.

  • Example

    Split text into terms when matching word separators

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"test,pattern,tokenizer!!!",
        "tokenizer": {
          
          
            "type":"pattern",
            "pattern":","
        }
    }
    

    According to "," word segmentation

    The result is as follows:

    Insert image description here

    Capture matching text as terms

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"\"test\"\"pattern,tokenizer!!!\"",
        "tokenizer": {
          
          
            "type":"pattern",
            "pattern":"\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
            "group":1 
        }
    }
    

    Capture values ​​enclosed in double quotes

    The group here refers to the group() api of Java's Matcher object.

    The result is as follows:

    Insert image description here

5.4 Term filter

5.4.1 Function

Term filters accept a stream of terms from the tokenizer and can modify terms (e.g. lowercase), remove terms (e.g. remove stop words), or add terms (e.g. synonyms).

The system has some built-in term filters. We will introduce several commonly used term filters in detail. For other term filters, please see the official documentation .

5.4.2 lowercase term filter

  • effect

    This term filter changes each term from uppercase to lowercase, such as Man to man, and it supports lowercase conversion for Greek, Irish, and Turkish.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like StEak !",
        "tokenizer": "keyword",
        "filter" : ["lowercase"]
    }
    

    The result is as follows:

    Insert image description here

5.4.3 uppercase term filter

  • effect

    This term filter will change each term from lower case to upper case, for example, man to MAN.

    Notice:

    Depending on the language, one uppercase character can map to multiple lowercase characters. Using the uppercase filter may result in the loss of lowercase character information, so it is more recommended to use the lowercase term filter

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like StEak !",
        "tokenizer": "keyword",
        "filter" : ["uppercase"]
    }
    

    The result is as follows:

    Insert image description here

5.4.4 length term filter

  • effect

    Removes terms that are shorter or longer than the specified character length. For example, you can exclude terms shorter than 2 characters and terms longer than 5 characters.

    Notice:

    This filter removes entire terms that do not meet the criteria. If you want to shorten terms to a specific length, use the truncate term filter .

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
    	"text": "I Like StEak !",
    	"tokenizer": "standard",
    	"filter": [{
          
          
    		"type": "length",
    		"min": 2,
    		"max": 4
    	}]
    }
    

    The result is as follows:

    Insert image description here

5.4.5 truncate term filter

  • effect

    This filter truncates terms that exceed the specified character limit. This limit defaults to 10, but can be customized using the length parameter.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
    	"text": "I Like StEak !",
    	"tokenizer": "keyword",
    	"filter": [{
          
          
    		"type": "truncate",
    		"length": 5
    	}]
    }
    

    The result is as follows:

    Insert image description here

5.4.6 ASCII folding term filter

  • effect

    This filter converts alphabetical, numeric, and symbolic characters (the first 127 ASCII characters) that are not in the Basic Latin Unicode block to their ASCII equivalents (if present). For example, the filter changes à to a.

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
    	"text": "açaí à la carte",
    	"tokenizer": "keyword",
    	"filter": ["asciifolding"]
    }
    

    The result is as follows:

    Insert image description here

5.4.7 stop term filter

  • effect

    This filter removes stop words from terms.

    If there are no custom stop words, the filter will delete the following English stop words by default:

    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

    The filter supports predefined stopword lists for multiple languages ​​in addition to English, and you can also specify your own stopwords as an array or file.

    Please check the official documentation for detailed configuration

  • Example

    Here are just simple examples.

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
    	"text": "a quick fox jumps over the lazy dog",
    	"tokenizer": "standard",
    	"filter": ["stop"]
    }
    

    The result is as follows:

    Insert image description here

5.5 Common built-in analyzers

Elasticsearch provides some built-in analyzers for users to use.

Next, we will introduce some commonly used built-in analyzers. For more detailed built-in analyzers, please see the official documentation.

5.5.1 standard analyzer

  • effect

    This analyzer is the default analyzer and is used by default if no analyzer is specified. It provides a word segmentation function based on the Unicode text segmentation algorithm, which removes most punctuation marks, converts terms to lowercase, and supports the removal of stop words. It is suitable for most languages, but the support for Chinese is not ideal. It will be split character by character and segmented by word.

  • composition

    • character filter

      none

    • tokenizer

      standard tokenizer

    • term filter

      lowercase term filter

      stop term filter (not enabled by default)

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like StEak !",
        "analyzer": "standard"
    }
    

    The result is as follows:

    Insert image description here

5.5.2 simple analyzer

  • effect

    The parser breaks the text into terms at any non-alphabetic characters (such as numbers, spaces, hyphens, and apostrophes), discarding non-alphabetic characters and converting uppercase to lowercase.

  • composition

    • character filter

      none

    • tokenizer

      lowercase tokenizer

    • term filter

      none

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like &StEak 和西红柿 2333 !",
        "analyzer": "simple"
    }
    

    The result is as follows:

    Insert image description here

5.5.3 stop analyzer

  • effect

    This analyzer is the same as the simple analyzer, but adds support for removing stop words.

  • composition

    • character filter

      none

    • tokenizer

      lowercase tokenizer

    • term filter

      stop term filter

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like &StEak the 和西红柿 2333 !",
        "analyzer": "stop"
    }
    

    The result is as follows:

    Insert image description here

5.5.4 pattern analyzer

  • effect

    This parser uses regular expressions to split text into individual terms, and convert terms from uppercase to lowercase. Regular expressions should match delimiters, not the terms themselves. Regular expressions default to \W+ (all non-word characters).

    Note: Java regular expressions are used here. If the performance of regular expressions is poor, it may cause stackoverflow exceptions.

  • composition

    • character filter

      none

    • tokenizer

      pattern tokenizer

    • term filter

      lowercase term filter

      stop term filter (not enabled by default)

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I,Like&StEak,the,和西红柿,2333 !",
        "analyzer":"pattern"
    }
    

    The result is as follows:

    Insert image description here

5.5.5 whitespace analyzer

  • effect

    The parser breaks the text into terms when whitespace characters are encountered.

  • composition

    • character filter

      none

    • tokenizer

      whitespace tokenizer

    • term filter

      none

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like &StEak the 和西红柿 2333 !",
        "analyzer":"whitespace"
    }
    

    The result is as follows:

    Insert image description here

5.5.6 keyword analyzer

  • effect

    This parser is a "noop" parser that returns the entire input string as a single term.

  • composition

    • character filter

      none

    • tokenizer

      keyword tokenizer

    • term filter

      none

  • Example

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"I Like &StEak the 和西红柿 2333 !",
        "analyzer":"keyword"
    }
    

    The result is as follows:

    Insert image description here

5.6 Custom Analyzer

When Elasticsearch's built-in analyzers cannot meet business needs, we can define custom analyzers to achieve business needs.

Custom analyzers are also composed of 0 or more character filters + a tokenizer + 0 or more term filters .

Commonly used configuration parameters of the analyzer are as follows :

parameter name meaning
type Analyzer type. Accepts built-in analyzer types. For custom analyzers, use custom or omit this parameter.
tokenizer Specify the tokenizer to be used, which can be a built-in tokenizer or a custom tokenizer.
char_filter Specifies the character filter to be used, which can be built-in or customized. What is received here is an array.
filter Specifies the term filter to be used, which can be built-in or customized. What is received here is an array.

Below we will explain each configuration through examples.

// PUT http://localhost:9200/custom-analyzer-test
// body内容
{
    
    
	"settings": {
    
    
		"analysis": {
    
    
			"analyzer": {
    
     // 当前索引中定义的分析器
				"my_analyzer": {
    
       // 自定义的分析器名称
                    "type":"custom", // 类型-自定义分析器
					"tokenizer": "my_tokenizer", // 自己定义的分词器
					"char_filter": ["&_to_and"],  // 自定义的分析器使用的字符过滤器集合
                    "filter":["my_stopwords","lowercase"] // 自定义的分析器使用的词项过滤器集合
				}
			},
			"char_filter": {
    
      // 当前索引中的字符过滤器集合
				"&_to_and": {
    
     // 自定义字符过滤器的名称
					"type": "mapping",  // 字符过滤器类型为mapping
                    "mappings":["& => and"]
				}
			},
            "tokenizer":{
    
     // 当前索引中定义的分词器
                "my_tokenizer":{
    
    
                    "type":"pattern",
                    "pattern":","
                }
            },
            "filter":{
    
      // 当前索引中定义的词项过滤器
                "my_stopwords":{
    
    
                    "type":"stop",
                    "stopwords": [ "the","a"]
                }
            }
		}
	}
}

Test a custom analyzer

// GET http://localhost:9200/custom-analyzer-test/_analyze
// body内容
{
    
    
    "text":"I,Like & StEak,the,和西红柿,2333 !",
    "analyzer":"my_analyzer"
}

The result is as follows:

Insert image description here

The workflow diagram of this custom analyzer is shown in the figure below.

Insert image description here

5.7 Chinese parser

5.7.1 Elasticsearch’s analysis support for Chinese

In the analyzer provided by Elasticsearch, the support for Chinese is not friendly enough. It cannot automatically recognize phrases in Chinese, such as: study, school, etc.

For example, use the standard analyzer to analyze "I am going to school to study".

// GET http://localhost:9200/_analyze
// body内容
{
    
    
    "text":"我要去学校学习",
    "analyzer":"standard"
}

The result is as follows:

Insert image description here

It can be seen from the above results that using the analyzer provided by Elasticsearch cannot meet our needs for Chinese word segmentation.

Therefore, we often use IK tokenizer for expansion.

5.7.2 Introduction to IK word segmenter

IK word segmenter is a free and open source java word segmenter. It is one of the more popular Chinese word segmenters at present. It is simple and stable, but if you want particularly good results, you need to maintain your own thesaurus and support custom dictionaries.

5.7.3 IK word segmenter-installation

  • download link

    https://github.com/medcl/elasticsearch-analysis-ik

    Note: You must choose the ik word segmenter corresponding to your Elasticsearch version.

  • deploy

    Unzip the downloaded IK word segmenter in the plugins subdirectory of the Elasticsearch installation directory.

Insert image description here

After decompression, restart Elasticsearch.

Note: If the error is as follows when starting:

java.lang.IllegalStateException: Could not load plugin descriptor for plugin directory [commons-codec-1.9.jar]
	at org.elasticsearch.plugins.PluginsService.readPluginBundle(PluginsService.java:403) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.findBundles(PluginsService.java:388) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.getPluginBundles(PluginsService.java:381) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:152) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:317) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) [elasticsearch-cli-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) [elasticsearch-7.8.0.jar:7.8.0]
Caused by: java.nio.file.NoSuchFileException: D:\Software\Work\elasticsearch-7.8.0\plugins\commons-codec-1.9.jar\plugin-descriptor.properties
	at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79) ~[?:?]
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?]
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[?:?]
	at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:230) ~[?:?]
	at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_91]
	at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_91]
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) ~[?:1.8.0_91]
	at java.nio.file.Files.newInputStream(Files.java:152) ~[?:1.8.0_91]
	at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:156) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.readPluginBundle(PluginsService.java:400) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 15 more
[2023-04-18T17:23:45,388][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [LAPTOP-AN1JMLBC] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Could not load plugin descriptor for plugin directory [commons-codec-1.9.jar]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:174) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) ~[elasticsearch-cli-7.8.0.jar:7.8.0]
	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.8.0.jar:7.8.0]
Caused by: java.lang.IllegalStateException: Could not load plugin descriptor for plugin directory [commons-codec-1.9.jar]
	at org.elasticsearch.plugins.PluginsService.readPluginBundle(PluginsService.java:403) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.findBundles(PluginsService.java:388) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.getPluginBundles(PluginsService.java:381) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:152) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:317) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 6 more
Caused by: java.nio.file.NoSuchFileException: D:\Software\Work\elasticsearch-7.8.0\plugins\commons-codec-1.9.jar\plugin-descriptor.properties
	at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79) ~[?:?]
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?]
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[?:?]
	at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:230) ~[?:?]
	at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_91]
	at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_91]
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) ~[?:1.8.0_91]
	at java.nio.file.Files.newInputStream(Files.java:152) ~[?:1.8.0_91]
	at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:156) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.readPluginBundle(PluginsService.java:400) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.findBundles(PluginsService.java:388) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.getPluginBundles(PluginsService.java:381) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:152) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:317) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 6 more

You can create a subdirectory ik in plugins, and then place the decompressed files in the ik subdirectory.

5.7.4 IK word segmenter configuration explanation

In the config directory of the ik word segmenter, there are some configurations that come with the ik word segmenter.

Insert image description here

The meaning of each configuration is shown in the following table

Configuration file name meaning
main.dic main lexicon
preposition.dic Modal particles (also, and, but, etc.)
stopword.dic English stop words
quantifier.dic unit of measurement word
suffix.dic Suffix words (province, city, institute, etc.)
surname.dic Dictionary of Hundred Family Surnames
extra_main.dic Extended main vocabulary
extra_single_word.dic、extra_single_word_full.dic、extra_single_word_low_freq.dic Expanded word library
extra_stopword.dic Extended stop vocabulary
IKAnalyzer.cfg.xml IK tokenizer configuration file

5.7.5 IK tokenizer-built-in analyzer

There are two analyzers built into the IK word segmenter: ik_max_word and ik_smart

  • i_max_word

    Split the text into the finest granularity

  • i_smart

    Split the text into the coarsest granularity

  • Compare the two analyzers with examples

    i_max_word

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"我要去学校学习",
        "analyzer":"ik_max_word"
    }
    

    The result is as follows:

    Insert image description here

    i_smart

    // GET http://localhost:9200/_analyze
    // body内容
    {
          
          
        "text":"我要去学校学习",
        "analyzer":"ik_smart"
    }
    

    The result is as follows:

Insert image description here

5.7.6 IK word segmenter-expanded phrase

When we need to make certain words form a specific phrase, we can expand the custom phrase by expanding the phrase.

For example, "League of Legends" is not a specific phrase in itself, it will be split by the tokenizer.

Let’s test it below

// GET http://localhost:9200/_analyze
// body内容
{
    
    
    "text":"英雄联盟",
    "analyzer":"ik_smart"
}

The result is as follows:

Insert image description here

As you can see, "League of Legends" will be split into two words: "Heroes" and "League". This is not what we want. We want "League of Legends" to be treated as one term.

At this time, we can configure it in the IKAnalyzer.cfg.xml file in the config directory of the ik word segmenter.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
</properties>

We make the following configuration

<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">mydic.dic</entry>
</properties>

Create the mydic.dic file in the config directory, write "League of Legends" into the file, and then restart Elasticsearch.

Next we test again

// GET http://localhost:9200/_analyze
// body内容
{
    
    
    "text":"英雄联盟",
    "analyzer":"ik_smart"
}

The result is as follows:

Insert image description here

As you can see, "League of Legends" is treated as a term and will not be split.

reference

[Silicon Valley] ElasticSearch tutorial from getting started to mastering (based on the new features of ELK technology stack elasticsearch 7.x+8.x)

Guess you like

Origin blog.csdn.net/weixin_42584100/article/details/131904890