ElasticSearch实战三（分词和映射）

ElasticSearch的文档映射机制（mapping）用于进行字段的类型确认，将每一个字段匹配为一种确定的数据类型。

1 ES字段类型

① 基本字段类型

字符串：text、keyword

text默认为全文文本，keyword默认为非全文文本

数字：long、integer、short、double、float

日期：date

逻辑：boolean

② 复杂数据类型

对象类型：object

数组类型：array

地理位置：geo_point,geo_shape

我们在创建一个索引的时候，字段并没有明确说明属于哪一个类型的数据，但是ES会根据默认的规则去匹配相应的数据类型的。下面的id，name，birthday,salary,set表示了不同类型的数据结构。

PUT /crm/user/1
{
  "id":1,
  "name":"gosaint",
  "birthday":"2018-11-03",
  "salary":1000.890,
  "sex":true
}

2 默认映射

查看索引类型的映射配置：GET {indexName}/_mapping/{typeName}

ES在没有配置Mapping的情况下新增文档，ES会尝试对字段类型进行猜测，并动态生成字段和类型的映射关系。

JSON type	Field type
Boolean: true or false	"boolean"
Whole number: 123	"long"
Floating point: 123.45	"float"
String, valid date:"2014-09-15"	"date"
String: "foo bar"	"string"

针对上述的实例，我们可以使用GET /crm/_mapping/user来查看数据的数据类型；

birthday:date

id : long

name:text。全文文本，默认会分词。

salary:flaot

sex :boolean

{
  "crm": {
    "mappings": {
      "user": {
        "properties": {
          "birthday": {
            "type": "date"
          },
          "id": {
            "type": "long"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "salary": {
            "type": "float"
          },
          "sex": {
            "type": "boolean"
          }
        }
      }
    }
  }
}

3 简单类型映射

type	类型：基本数据类型，integer,long,date,boolean,keyword,text...
enable	是否启用：默认为true。 false：不能索引、不能搜索过滤，仅在_source中存储
boost	权重提升倍数：用于查询时加权计算最终的得分。
format	格式：一般用于指定日期格式，如 yyyy-MM-dd HH:mm:ss.SSS
ignore_above	长度限制：长度大于该值的字符串将不会被索引和存储。
ignore_malformed	转换错误忽略：true代表当格式转换错误时，忽略该值，被忽略后不会被存储和索引。
include_in_all	是否将该字段值组合到_all中。
null_value	默认控制替换值。如空字符串替换为”NULL”，空数字替换为-1
store	是否存储：默认为false。true意义不大，因为_source中已有数据
index	索引模式：analyzed (索引并分词，text默认模式), not_analyzed (索引不分词，keyword默认模式)，no（不索引）
analyzer	索引分词器：索引创建时使用的分词器，如ik_smart,ik_max_word,standard
search_analyzer	搜索分词器：搜索该字段的值时，传入的查询内容的分词器。
fields	多字段索引：当对该字段需要使用多种索引模式时使用。如：城市搜索New York "city": { "type": "text", "analyzer": "ik_smart", "fields": { "raw": { "type": "keyword" } } } 那么以后搜索过滤和排序就可以使用city.raw字段名

4 自定义类型映射

在上述的实例中，ES默认已经为相关的字段指定类型。因此我们不可能改变原来字段的类型。如我们不能去修改id的类型为integer。但是对于追加的字段还是可以指定类型映射的。

① 针对单个类型的映射配置方式

POST {indexName}/{typeName}/_mapping
{

    "{typeName}": {

        "properties": {

            "id": {

                "type": "long"

            },

            "content": {

                "type": "text",

                "analyzer": "ik_smart",

                "search_analyzer": "ik_smart"

            }

        }

    }

}

POST /crm/user/_mapping
{
    "user": {
        "properties": {
            "id": {
                "type": "integer"
            },
           "name":{
             "type":"keyword"
           },
           "birthday":{
             "type":"text"
           },
           "salary":{
             "type":"double"
           },
           "sex":{
             "type":"boolean"
           }
        }
    }
}

看看响应结果：

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "mapper [name] of different type, current_type [text], merged_type [keyword]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "mapper [name] of different type, current_type [text], merged_type [keyword]"
  },
  "status": 400
}

这恰好说明，原来的字段映射存在，再次指定的时候就会失败。如果要想重新指派，那么就要删除原来的索引库。

删除crm索引库。DELETE /crm。然后创建crm PUT /crm。然后运行上述的自定义指派映射。

{
  "crm": {
    "mappings": {
      "user": {
        "_all": {
          "enabled": false
        },
        "dynamic_templates": [
          {
            "string_as_text": {
              "match": "*_text",
              "match_mapping_type": "string",
              "mapping": {
                "analyzer": "ik_max_word",
                "fields": {
                  "raw": {
                    "ignore_above": 256,
                    "type": "keyword"
                  }
                },
                "search_analyzer": "ik_max_word",
                "type": "text"
              }
            }
          },
          {
            "string_as_keyword": {
              "match_mapping_type": "string",
              "mapping": {
                "type": "keyword"
              }
            }
          }
        ],
        "properties": {
          "birthday": {
            "type": "text"
          },
          "id": {
            "type": "integer"
          },
          "name": {
            "type": "keyword"
          },
          "salary": {
            "type": "double"
          },
          "sex": {
            "type": "boolean"
          }
        }
      }
    }
  }
}

可以看到我们自定义映射成功啦。

② 同时对多个类型的映射配置方式（推荐）

PUT {indexName}

{

  "mappings": {

    "user": {

      "properties": {

        "id": {

          "type": "integer"

        },

        "info": {

          "type": "text",

          "analyzer": "ik_smart",

          "search_analyzer"

        }

      }

    },

    "dept": {

      "properties": {

        "id": {

          "type": "integer"

        },

        ....更多字段映射配置

      }

    }

  }

}

5 全局映射

全局映射可以通过动态模板和默认设置两种方式实现。

默认方式：_default_

索引下所有的类型映射配置会继承_default_的配置，如：

PUT {indexName}
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "dept": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

上例中：user和dept都会继承_default_的配置，user类型的文档中将不会合并所有字段到_all，而dept会。

动态模板：dynamic_templates

注意：ES会默认把string类型的字段映射为text类型（默认使用标准分词器）和对应的keyword类型，如：

"name": {

     "type": "text",

     "fields": {

         "keyword": {

             "type": "keyword",

             "ignore_above": 256

          }

      }

}

在实际应用场景中，一个对象的属性中，需要全文检索的字段较少，大部分字符串不需要分词，因此，需要利用全局模板覆盖自带的默认模板：

PUT _template/global_template  //创建名为global_template的模板

{

  "template":   "*",  //匹配所有索引库

  "settings": { "number_of_shards": 1 }, //匹配到的索引库只创建1个主分片

  "mappings": {

    "_default_": {

      "_all": { 

        "enabled": false //关闭所有类型的_all字段

      },

      "dynamic_templates": [

        {

          "string_as_text": {

            "match_mapping_type": "string",//匹配类型string

            "match":   "*_text", //匹配字段名字以_text结尾

            "mapping": {

              "type": "text",//将类型为string的字段映射为text类型

              "analyzer": "ik_max_word",

              "search_analyzer": "ik_max_word",

              "fields": {

                "raw": {

                  "type":  "keyword",

                  "ignore_above": 256

                }

              }

            }

          }

        },

        {

          "string_as_keyword": { 

            "match_mapping_type": "string",//匹配类型string

            "mapping": {

              "type": "keyword"//将类型为string的字段映射为keyword类型

             }

          }

        }

      ]

    }

  }}

说明：上例中定义了两种动态映射模板string_as_text和string_as_keyword.

在实际的类型字段映射时，会依次匹配：

①字段自定义配置

②全局dynamic_templates[string_as_text、string_as_keyword]、

③索引dynamic_templates[...]

④ES自带的string类型映射，以最先匹配上的为准。

注意：索引库在创建的时候会继承当前最新的dynamic_templates，索引库创建后，修改动态模板，无法应用到已存在的索引库。

6 最佳实践

映射的配置会影响到后续数据的索引过程，因此，在实际项目中应遵循如下顺序规则：

① 配置全局动态模板映射（覆盖默认的string映射）

② 配置字段映射（由于基本类型主要用于过滤和普通查询，因此，字段映射主要对需要全文检索的字段进行配置）

③ 创建、更新和删除文档

④ 搜索