学习笔记 | ElasticSearch 中文分词

此文是之前ES技术测试的笔记,主要是简单的测试结果记录。

中文分词包

此处分词用的是ik分词,分词效果还是不错的,而且只要将自己的特殊短语加到配置中即可准确分词。

下载ik包,解压到plugins目录下,5.5.1会自动加载,不需要在配置文件里配置了

GitHub中有详细的说明以及对应ES版本的分词包,GitHub 传送门

示例

  1. 创建索引

     curl -XPUT http://localhost:9200/index
    
  2. 创建mapping

     curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
     {
       "properties": {
           "content": {
               "type": "text",
               "analyzer": "ik_max_word",
               "search_analyzer": "ik_max_word"
           }
       }
     }'
    
  3. 添加数据

     curl -XPOST http://localhost:9200/index1/fulltext/1 -d'
     {"content":"战狼2真是个好电影啊"}
     '
     curl -XPOST http://localhost:9200/index1/fulltext/2 -d'
     {"content":"战狼良心之作啊"}
     '
    
     curl -XPOST http://localhost:9200/index1/fulltext/3 -d'
     {"content":"三生三世锁场"}
     '
    
  4. 查询 match

     curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
     {
         "query" : { "match" : { "content" : "战狼2" }},
         "highlight" : {
             "pre_tags" : ["<tag1>", "<tag2>"],
             "post_tags" : ["</tag1>", "</tag2>"],
             "fields" : {
                 "content" : {}
             }
         }
     }
     '
    
     Result:
     {
         "took": 36, 
         "timed_out": false, 
         "_shards": {
             "total": 5, 
             "successful": 5, 
             "failed": 0
         }, 
         "hits": {
             "total": 2, 
             "max_score": 0.854655, 
             "hits": [
                 {
                     "_index": "index", 
                     "_type": "fulltext", 
                     "_id": "1", 
                     "_score": 0.854655, 
                     "_source": {
                         "content": "战狼2真是个好电影啊"
                     }, 
                     "highlight": {
                         "content": [
                             "<tag1>战</tag1><tag1>狼</tag1><tag1>2</tag1>真是个好电影啊"
                         ]
                     }
                 }, 
                 {
                     "_index": "index", 
                     "_type": "fulltext", 
                     "_id": "2", 
                     "_score": 0.5716521, 
                     "_source": {
                         "content": "战狼良心之作啊"
                     }, 
                     "highlight": {
                         "content": [
                             "<tag1>战</tag1><tag1>狼</tag1>良心之作啊"
                         ]
                     }
                 }
             ]
         }
     }
    
  5. 查询 match_phrase

     curl -XGET 'localhost:9200/index/fulltext/_search?pretty' -H 'Content-Type: application/json' -d'
     {
         "query": {
             "match_phrase" : {
                 "content" : "战狼2"
             }
         }
     }
     '       
    
     Result
     {
       "took" : 1,
       "timed_out" : false,
       "_shards" : {
         "total" : 5,
         "successful" : 5,
         "failed" : 0
       },
       "hits" : {
         "total" : 1,
         "max_score" : 0.85465515,
         "hits" : [
           {
             "_index" : "index",
             "_type" : "fulltext",
             "_id" : "1",
             "_score" : 0.85465515,
             "_source" : {
               "content" : "战狼2真是个好电影啊"
             }
           }
         ]
       }
     }
    
  6. 查看分词器效果

    格式:http://localhost:9200/your_index/_analyze?text=中华人民共和国MN&tokenizer=my_ik

    示例:http://localhost:9200/index/_analyze?text=中华人民共和国MN&tokenizer=chinese

猜你喜欢

转载自blog.csdn.net/weixin_34110749/article/details/87196232