Elasticsearch词频统计实现

IK分词器 & pinyin分词器的安装

ES的安装目录下执行

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.2.0/elasticsearch-analysis-ik-7.2.0.zip

若是离线安装可以使用下列命令

cd plugins/
mkdir ik
mkdir pinyin
unzip ../plugin-zips/elasticsearch-analysis-ik-7.5.1.zip -d plugins/ik

IK分词器的说明

ik_max_word 和 ik_smart 什么区别

ik_max_word 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合

ik_smart 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”

下面的例子使用ik_max_word并需要启用 fielddata 的能力


PUT message_index
{
   "mappings": {
       "properties":{
            "message": {
               "analyzer": "ik_max_word",
               "term_vector": "with_positions_offsets",
                "boost": 8,
                "type": "text",
                "fielddata":"true"
            }
        }
  }
}

POST message_index/_doc/1
{
  "message":"《原神》霄宫角色PV——「鸣神岛夏天的象征」"
}

POST message_index/_doc/2
{
  "message":"原神神里和霄宫该如何选择?全网最强评测"
}

POST message_index/_doc/3
{
  "message":"原神:雷神心口拔刀,一刀斩败主角,最后还嫌我太慢抽完万叶抽神里,没有人比我更懂原神保底"
}

POST message_index/_doc/4
{
  "message":"原神:神里怎么会加血?雷神稳稳的了,常驻池五虎上将齐了"
}

POST message_index/_doc/4
{
  "message":"将会出现雷神和心海,还会有个神秘的5星角色原神"
}

POST message_index/_doc/5
{
  "message":"氪金原神2.0,脸黑无下限!亏到自闭!"
}

POST message_index/_doc/6
{
  "message":"我宣布原神氪金不再适合我,歪到大气层外面的万叶不抽也罢"
}

POST message_index/_doc/7
{
  "message":"联合参展视频烟绯生日快乐哦"
}

POST message_index/_doc/8
{
  "message":"可莉的生日礼物《原神》拾枝杂谈"
}

POST message_index/_doc/9
{
  "message":"神里怎么会加血?雷神稳稳的了,常驻池五虎上将齐了"
}

执行并查看结果


POST message_index/_search
{
    
    
   "size" : 0,  
    "aggs" : {
    
       
        "messages" : {
    
       
            "terms" : {
    
       
               "size" : 15,
              "field" : "message"
            }  
        }  
    }
}

## 返回结果
{
    
    
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    
    
    "total" : {
    
    
      "value" : 9,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    
    
    "messages" : {
    
    
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 91,
      "buckets" : [
        {
    
    
          "key" : "神",
          "doc_count" : 8
        },
        {
    
    
          "key" : "原",
          "doc_count" : 7
        },
        {
    
    
          "key" : "的",
          "doc_count" : 4
        },
        {
    
    
          "key" : "里",
          "doc_count" : 3
        },
        {
    
    
          "key" : "雷",
          "doc_count" : 3
        },
        {
    
    
          "key" : "万",
          "doc_count" : 2
        },
        {
    
    
          "key" : "叶",
          "doc_count" : 2
        },
        {
    
    
          "key" : "和",
          "doc_count" : 2
        },
        {
    
    
          "key" : "宫",
          "doc_count" : 2
        },
        {
    
    
          "key" : "氪",
          "doc_count" : 2
        },
        {
    
    
          "key" : "生日",
          "doc_count" : 2
        },
        {
    
    
          "key" : "角色",
          "doc_count" : 2
        },
        {
    
    
          "key" : "金",
          "doc_count" : 2
        },
        {
    
    
          "key" : "霄",
          "doc_count" : 2
        },
        {
    
    
          "key" : "2.0",
          "doc_count" : 1
        }
      ]
    }
  }
}

猜你喜欢

转载自blog.csdn.net/mini_snow/article/details/119457707