Elasticsearch实战(二):Springboot实现Elasticsearch自动汉字、拼音补全,Springboot实现自动拼写纠错

系列文章索引

Elasticsearch实战(一):Springboot实现Elasticsearch统一检索功能
Elasticsearch实战(二):Springboot实现Elasticsearch自动汉字、拼音补全,Springboot实现自动拼写纠错
Elasticsearch实战(三):Springboot实现Elasticsearch搜索推荐
Elasticsearch实战(四):Springboot实现Elasticsearch指标聚合与下钻分析
Elasticsearch实战(五):Springboot实现Elasticsearch电商平台日志埋点与搜索热词

一、安装ik拼音分词器插件

1、下载地址

源码地址:https://github.com/medcl/elasticsearch-analysis-pinyin
下载地址:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
我们本次使用7.4.0版本的:https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip

2、下载安装

mkdir /mydata/elasticsearch/plugins/elasticsearch-analysis-pinyin-7.4.0
cd /mydata/elasticsearch/plugins/elasticsearch-analysis-pinyin-7.4.0
# 下载
wget https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip

# 解压
unzip elasticsearch-analysis-pinyin-7.4.0.zip
rm -f elasticsearch-analysis-pinyin-7.4.0.zip
# 重启es
docker restart 558eded797f9

3、属性大全

在这里插入图片描述
当我们创建索引时可以自定义分词器,通过指定映射去匹配自定义分词器:

{
    
    
    "indexName": "product_completion_index",
    "map": {
    
    
        "settings": {
    
    
            "number_of_shards": 1,
            "number_of_replicas": 2,
            "analysis": {
    
    
                "analyzer": {
    
    
                    "ik_pinyin_analyzer": {
    
    
                        "type": "custom",
                        "tokenizer": "ik_smart",
                        "filter": "pinyin_filter"
                    }
                },
                "filter": {
    
    
                    "pinyin_filter": {
    
    
                        "type": "pinyin",
                        "keep_first_letter": true,
                        "keep_separate_first_letter": false,
                        "keep_full_pinyin": true,
                        "keep_original": true,
                        "limit_first_letter_length": 16,
                        "lowercase": true,
                        "remove_duplicated_term": true
                    }
                }
            }
        },
        "mapping": {
    
    
            "properties": {
    
    
                "name": {
    
    
                    "type": "text"
                },
                "searchkey": {
    
    
                    "type": "completion",
                    "analyzer": "ik_pinyin_analyzer"
                }
            }
        }
    }
}

二、自定义语料库

1、新增索引映射

/*
 * @Description: 新增索引+setting+映射+自定义分词器pinyin
 * setting可以为空(自定义分词器pinyin在setting中)
 * 映射可以为空
 * @Method: addIndexAndMapping
 * @Param: [commonEntity]
 * @Return: boolean
 *
 */
public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception {
    
    
    //设置setting的map
    Map<String, Object> settingMap = new HashMap<String, Object>();
    //创建索引请求
    CreateIndexRequest request = new CreateIndexRequest(commonEntity.getIndexName());
    //获取前端参数
    Map<String, Object> map = commonEntity.getMap();
    //循环外层的settings和mapping
    for (Map.Entry<String, Object> entry : map.entrySet()) {
    
    
        if ("settings".equals(entry.getKey())) {
    
    
            if (entry.getValue() instanceof Map && ((Map) entry.getValue()).size() > 0) {
    
    
                request.settings((Map<String, Object>) entry.getValue());
            }
        }
        if ("mapping".equals(entry.getKey())) {
    
    
            if (entry.getValue() instanceof Map && ((Map) entry.getValue()).size() > 0) {
    
    
                request.mapping((Map<String, Object>) entry.getValue());
            }

        }
    }
    //创建索引操作客户端
    IndicesClient indices = client.indices();
    //创建响应对象
    CreateIndexResponse response = indices.create(request, RequestOptions.DEFAULT);
    //得到响应结果
    return response.isAcknowledged();
}

CommonEntity 的内容:
settings下面的为索引的设置信息,动态设置参数,遵循DSL写法
mapping下为映射的字段信息,动态设置参数,遵循DSL写法

{
    
    
    "indexName": "product_completion_index",
    "map": {
    
    
        "settings": {
    
    
            "number_of_shards": 1,
            "number_of_replicas": 2,
            "analysis": {
    
    
                "analyzer": {
    
    
                    "ik_pinyin_analyzer": {
    
    
                        "type": "custom",
                        "tokenizer": "ik_smart",
                        "filter": "pinyin_filter"
                    }
                },
                "filter": {
    
    
                    "pinyin_filter": {
    
    
                        "type": "pinyin",
                        "keep_first_letter": true,
                        "keep_separate_first_letter": false,
                        "keep_full_pinyin": true,
                        "keep_original": true,
                        "limit_first_letter_length": 16,
                        "lowercase": true,
                        "remove_duplicated_term": true
                    }
                }
            }
        },
        "mapping": {
    
    
            "properties": {
    
    
                "name": {
    
    
                    "type": "keyword"
                },
                "searchkey": {
    
    
                    "type": "completion",
                    "analyzer": "ik_pinyin_analyzer"
                }
            }
        }
    }
}

或者直接在kibana中执行:

PUT product_completion_index
{
    
    
    "settings": {
    
    
        "number_of_shards": 1,
        "number_of_replicas": 2,
        "analysis": {
    
    
            "analyzer": {
    
    
                "ik_pinyin_analyzer": {
    
    
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": "pinyin_filter"
                }
            },
            "filter": {
    
    
                "pinyin_filter": {
    
    
                    "type": "pinyin",
                    "keep_first_letter": true,
                    "keep_separate_first_letter": false,
                    "keep_full_pinyin": true,
                    "keep_original": true,
                    "limit_first_letter_length": 16,
                    "lowercase": true,
                    "remove_duplicated_term": true
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "name": {
    
    
                "type": "keyword"
            },
            "searchkey": {
    
    
                "type": "completion",
                "analyzer": "ik_pinyin_analyzer"
            }
        }
    }
}

2、批量新增文档

/*
 * @Description: 批量新增文档,可自动创建索引、自动创建映射
 * @Method: bulkAddDoc
 * @Param: [indexName, map]
 *
 */
public static RestStatus bulkAddDoc(CommonEntity commonEntity) throws Exception {
    
    
    //通过索引构建批量请求对象
    BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName());
    //循环前台list文档数据
    for (int i = 0; i < commonEntity.getList().size(); i++) {
    
    
        bulkRequest.add(new IndexRequest().source(XContentType.JSON, SearchTools.mapToObjectGroup(commonEntity.getList().get(i))));
    }
    //执行批量新增
    BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
    return bulkResponse.status();
}

public static void main(String[] args) throws Exception {
    
    
	// 批量插入
    CommonEntity commonEntity = new CommonEntity();
    commonEntity.setIndexName("product_completion_index"); // 索引名
    List<Map<String, Object>> list = new ArrayList<>();
    commonEntity.setList(list);
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米手机").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米11").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米电视").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米9").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米手机").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米手环").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米笔记本").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "小米摄像头").putData("name", "小米(MI)"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "adidas男鞋").putData("name", "adidas男鞋"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "adidas女鞋").putData("name", "adidas女鞋"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "adidas外套").putData("name", "adidas外套"));
    list.add(new CommonMap<String, Object>().putData("searchkey", "adidas裤子").putData("name", "adidas裤子"));
    bulkAddDoc(commonEntity);
}

3、查询结果

GET product_completion_index/_search

三、产品搜索与汉字、拼音自动补全

1、概念

Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议。
Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系。
Completion Suggester,它主要针对的应用场景就是"Auto Completion"。
Context Suggester:上下文建议器。

GET product_completion_index/_search
{
    
    
    "from": 0,
    "size": 100,
    "suggest": {
    
    
        "czbk-suggest": {
    
    
            "prefix": "小米",
            "completion": {
    
    
                "field": "searchkey",
                "size": 20,
                "skip_duplicates": true
            }
        }
    }
}

2、java实现汉字自动补全

/*
 * @Description: 自动补全 根据用户的输入联想到可能的词或者短语
 * @Method: suggester
 * @Param: [commonEntity]
 * @Update:
 * @since: 1.0.0
 * @Return: org.elasticsearch.action.search.SearchResponse
 * >>>>>>>>>>>>编写思路简短总结>>>>>>>>>>>>>
 * 1、定义远程查询
 * 2、定义查询请求(评分排序)
 * 3、定义自动完成构建器(设置前台建议参数)
 * 4、将自动完成构建器加入到查询构建器
 * 5、将查询构建器加入到查询请求
 * 6、获取自动建议的值(数据结构处理)
 */
public static List<String> cSuggest(CommonEntity commonEntity) throws Exception {
    
    

    //定义返回
    List<String> suggestList = new ArrayList<>();
    //构建查询请求
    SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName());
    //通过查询构建器定义评分排序
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
    //构造搜索建议语句,搜索条件字段
    CompletionSuggestionBuilder completionSuggestionBuilder =new CompletionSuggestionBuilder(commonEntity.getSuggestFileld());
    //搜索关键字
    completionSuggestionBuilder.prefix(commonEntity.getSuggestValue());
    //去除重复
    completionSuggestionBuilder.skipDuplicates(true);
    //匹配数量
    completionSuggestionBuilder.size(commonEntity.getSuggestCount());
    searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("common-suggest", completionSuggestionBuilder));
    //common-suggest为返回的字段,所有返回将在common-suggest里面,可写死,sort按照评分排序
    searchRequest.source(searchSourceBuilder);
    //定义查找响应
    SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT);
    //定义完成建议对象
    CompletionSuggestion completionSuggestion = suggestResponse.getSuggest().getSuggestion("common-suggest");
    List<CompletionSuggestion.Entry.Option> optionsList = completionSuggestion.getEntries().get(0).getOptions();
    //从optionsList取出结果
    if (!CollectionUtils.isEmpty(optionsList)) {
    
    
        optionsList.forEach(item -> suggestList.add(item.getText().toString()));
    }
    return suggestList;
}

public static void main(String[] args) throws Exception {
    
    

    // 自动补全
    CommonEntity suggestEntity = new CommonEntity();
    suggestEntity.setIndexName("product_completion_index"); // 索引名
    suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
    suggestEntity.setSuggestValue("小米"); //  自动补全输入的关键字
    suggestEntity.setSuggestCount(5); // 自动补全返回个数

    System.out.println(cSuggest(suggestEntity));
    // 结果:[小米11, 小米9, 小米手机, 小米手环, 小米摄像头]
    // 自动补全自动去重

}

3、java实现拼音自动补全

// (1)自动补全 :全拼访问
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xiaomi"); //  自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]

// (2)自动补全 :全拼访问(分隔)
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xiao mi"); //  自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]

// (3)自动补全 :首字母访问
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xm"); //  自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]

四、语言处理(拼写纠错)

1、实例

GET product_completion_index/_search
{
    
    
    "suggest": {
    
    
        "common-suggestion": {
    
    
            "text": "adidaas男鞋",
            "phrase": {
    
    
                "field": "name",
                "size": 13
            }
        }
    }
}

在这里插入图片描述

2、java实现拼写纠错

/*
 * @Description: 拼写纠错
 * @Method: psuggest
 * @Param: [commonEntity]
 * @Update:
 * @since: 1.0.0
 * @Return: java.util.List<java.lang.String>
 * >>>>>>>>>>>>编写思路简短总结>>>>>>>>>>>>>
 * 1、定义远程查询
 * 2、定义查询请求(评分排序)
 * 3、定义自动纠错构建器(设置前台建议参数)
 * 4、将拼写纠错构建器加入到查询构建器
 * 5、将查询构建器加入到查询请求
 * 6、获取拼写纠错的值(数据结构处理)
 */
public static String pSuggest(CommonEntity commonEntity) throws Exception {
    
    
    //定义返回
    String pSuggestString = new String();
    //定义查询请求
    SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName());
    //定义查询条件构建器
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    //定义排序器
    searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
    //构造短语建议器对象(参数为匹配列)
    PhraseSuggestionBuilder pSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld());
    //搜索关键字(被纠错的值)
    pSuggestionBuilder.text(commonEntity.getSuggestValue());
    //匹配数量
    pSuggestionBuilder.size(1);
    searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("common-suggest", pSuggestionBuilder));
    searchRequest.source(searchSourceBuilder);
    //定义查找响应
    SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT);
    //定义短语建议对象
    PhraseSuggestion phraseSuggestion = suggestResponse.getSuggest().getSuggestion("common-suggest");
    //获取返回数据
    List<PhraseSuggestion.Entry.Option> optionsList = phraseSuggestion.getEntries().get(0).getOptions();
    //从optionsList取出结果
    if (!CollectionUtils.isEmpty(optionsList) &&optionsList.get(0).getText()!=null) {
    
    
        pSuggestString = optionsList.get(0).getText().string().replaceAll(" ","");
    }
    return pSuggestString;
}


public static void main(String[] args) throws Exception {
    
    

    CommonEntity suggestEntity = new CommonEntity();
    suggestEntity.setIndexName("product_completion_index"); // 索引名
    suggestEntity.setSuggestFileld("name"); // 自动补全查找列
    suggestEntity.setSuggestValue("adidaas男鞋"); //  自动补全输入的关键字
    System.out.println(pSuggest(suggestEntity)); // 结果:adidas男鞋
}

五、总结

  1. 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
  2. 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录返回
  3. 为了提升准确率,通常都是前缀搜索

猜你喜欢

转载自blog.csdn.net/A_art_xiang/article/details/132259599