table of Contents
Customize extension words and stop words in IK
ES underlying index principle
IK tokenizer
1. Definition : is to split the keywords in a text
I’m Xiao Ming’s classmate
Word segmentation principle: split keywords to remove stop words and stop words
2. Word segmentation provided in ES
1. The default standard analyzer standard analyzer English: word segmentation Chinese: single-character word segmentation
2. Simple simple analyzer English: word segmentation remove the number Chinese: no word segmentation
3. Test different tokenizers
GET /_analyzer
{
"analyzer":"simple",
"text":"redis 非常好用 111"
}
- The result of standard segmentation is: redis is very easy to use 111
- The result of simple word segmentation is: redis is very easy to use
4.github based on ES tokenizer IK tokenizer
Note: The use of IK tokenizer and ES version must be strictly consistent
5. What is the difference between ik_max_word and ik_smart?
- ik_max_word: The text will be split at the finest granularity , such as splitting "I am Xiao Ming's classmate" into "I am Xiao Ming's classmate", "I am", "I am Xiao Ming", "Xiao Ming's classmate" ,"Classmate", will exhaust all possible combinations. I am Xiao Ming’s classmate
- ik_smart: Will do the most coarse-grained split, such as splitting "I am Xiao Ming's classmate" into "I am Xiao Ming's classmate"
PUT /emp
{
"mappings":{
"emp":{
"properties":{
"name":{
"type":"text",
"analyzer":"ik_max_word"
},
"age":{
"type":"integer"
},
"bir":{
"type":"date"
},
"content":{
"type":"text",
"analyzer":"ik_max_word"
},
"address":{
"type":"keyword"
}
}
}
}
}
Customize extension words and stop words in IK
1. Expansion words
Definition: The existing ik tokenizer cannot segment this word into a keyword, but hopes that a certain word becomes a keyword
ik tokenizer, etc. can be split into keywords, such as some popular online words
Configure IK configuration file: The name in the /plugins/ik/config directory under the ES installation directory: IKAnalyzer.cfg.xml
Modify the configuration file to add the following configuration:
<!--Users can configure their own extended dictionary here>
<entry key="ext_dict">ext.dic</entry>
2. Stop words
Definition: The existing ik tokenizer divides a keyword into one word, but for some reason this word cannot appear as a keyword
<entry key="ext_stopwords">stopext.dic</entry>
3. Configure remote extension dictionary
EN 中 Query
1. Query String! Query DSL query
Keyword query -----> calculate score, sort, etc. series
2. Filter Quey filter query efficiency is relatively high
Filter out the data that meets the conditions --------> Document score will not be calculated, sorted, commonly used Filter automatically commonly used fiter results
You must use bool expressions to combine the two queries
Note: When filterQuery and query are used in combination, the statement in fiterQuery is executed first, and then the statement in query is executed
Filtering is suitable for filtering data in a large range , while query is suitable for matching data exactly . In general applications, filter data should be used first , and then query matching data should be used .