ES search framework - set IK tokenizer

The default Chinese word segmentation effect of ES is too poor, and a slightly longer sentence cannot be matched completely, so I chose to install the IK Chinese word segmenter to realize the word segmentation of the index.

reference:

https://blog.csdn.net/w1014074794/article/details/119762827

https://www.bbsmax.com/A/6pdDqDaXzw/

image

1. Installation

Official website tutorial:

https://github.com/medcl/elasticsearch-analysis-ik , pay attention to the version corresponding to the problem

image

1. Download

Download prebuilt packages from here: https://github.com/medcl/elasticsearch-analysis-ik/releases

According to the version matching, I am using ES7.10.2, so I need to download the corresponding ik7.10.2 (if the version does not match, the ik tokenizer will not be available)

image

2. Unzip

Create the ik directory under the plugins folder under the ES installation folder, extract the zip file to the ik directory, and delete the zip

image

3. Test after restarting ES

(1) The effect of native tokenizer

GET /_analyze
{
  "analyzer": "standard",
  "text": "中华人民共和国"
}

image

(2) Effect of ik word breaker

①ik_max_word

It will analyze the text at the most fine-grained level, and will exhaust all possible combinations, suitable for entry query;

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国"
}

image

②ik_smart

Will do the most coarse-grained split, suitable for phrase query.

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国"
}

image

2. Project use

删除之前的索引,创建新的指定分词器的索引(对相应字段设定分词器),并将数据重新导入后测试检索效果

索引字段详解:https://www.cnblogs.com/hld123/p/16538466.html

fields can allow the same text to have multiple different indexing methods. For example, a String type field city is shown, which can use the text type for full-text search, and the keyword type for aggregation and sorting.  

PUT index_name
{
  "mappings": {         # 设置 mappings
    "properties": {     # 属性,固定写法
      "city": {         # 字段名
        "type": "text", # city 字段的类型为 text
        "fields": {     # 多字段域,固定写法
          "raw": {      # 子字段名称
            "type":  "keyword"  # 子字段类型
            "ignore_above": 256  #在ElasticSearch中keyword,text类型字段ignore_above属性(动态映射默认是256) ,表示最大的字段值长度,超出这个长度的字段将不会被索引,查询不到,但是会存储。
          }
        }
      }
    }
  }
}

Use the fine-grained mode to specify the word breaker through the analyzer property ik_max_word; use the smart mode when specifying the query through the search_analyzer propertyik_smart

image

1.创建json对象作为索引mapping

由于数据类型较多,使用json文件,将其装换为json对象

(1)pom.xml

<!--json文件转json对象-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.54</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-io</artifactId>
            <version>1.3.2</version>
        </dependency>

(2)JsonUtil

package org.project.es.common.util;

import java.io.InputStream;
import org.apache.commons.io.IOUtils;
import com.alibaba.fastjson.JSONObject;
/**
 * 将json文件装换为json对象
 * @author Administrator
 */
public class JsonUtil {
    public static JSONObject fileToJson(String fileName) {
        JSONObject json = null;
        try (
                InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName);
        ) {
            json = JSONObject.parseObject(IOUtils.toString(is, "utf-8"));
        } catch (Exception e) {
            System.out.println(fileName + "文件读取异常" + e);
        }
        return json;
    }
    public static void main(String[] args) {
        String fileName = "doc/policy.json";
        JSONObject json = JsonUtil.fileToJson(fileName);
        System.out.println(json);
    }
}
Effect:

image

2. Create an index

public static void createIndex(RestHighLevelClient client,String index) throws IOException {
        // 1.创建索引 - 请求对象
        CreateIndexRequest request = new CreateIndexRequest(index);
        // 2.设置setting,也就是索引的基本配置信息,将setting添加到请求对象中
        Settings setting = Settings.builder()
                //设置分片数,主分片数量一旦设置后就不能修改了
                .put("index.number_of_shards", 1)
                //索引的刷新时间间隔,索引更新多久才对搜索可见(即数据写入es到可以搜索到的时间间隔,设置越小越靠近实时,但是索引的速度会明显下降,),
                // 默认为1秒,如果我们对实时搜索没有太大的要求,反而更注重索引的速度,那么我们就应该设置的稍微大一些,这里设置30s
                .put("index.refresh_interval", "30s")
                //每个节点上允许最多分片数
                .put("index.routing.allocation.total_shards_per_node", 3)
                //将数据同步到磁盘的频率,为了保证性能,插入ES的数据并不会立刻落盘,而是首先存放在内存当中,
                // 等到条件成熟后触发flush操作,内存中的数据才会被写入到磁盘当中
                .put("index.translog.sync_interval", "30s")
                //每个主分片拥有的副本数,副本数量可以修改
                .put("index.number_of_replicas", 1)
                //一次最多获取多少条记录
                .put("index.max_result_window", "10000000")
                .build();
        request.settings(setting);

        // 3.使用工具类将json文件转换为json对象,添加到请求对象中
        JSONObject mapping = JsonUtil.fileToJson("doc/policy.json");
        request.mapping(mapping);

        // 4.发送请求,获取响应
        CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT);
        boolean acknowledged = response.isAcknowledged();
        // 5.输出响应状态
        System.out.println("操作状态 = " + acknowledged);
    }

view index

image

3. Import data

After the index is created, import the mysql data into es for easy retrieval

4. Test Retrieval

The effect of using the default word breaker before was very poor, and the search results were not satisfactory. After installing and using the IK word breaker, the search test was performed again, and the effect was found to be good, and the search speed was also very fast.

image

Guess you like

Origin blog.csdn.net/qq_51641196/article/details/130037732