ElasticSearch学习(二)-Document

常见术语

文档 Document:用户存储到es中的数据文档。类比数据库中表中的一条数据，是es中的最小单元。

索引 Index：由具有相同字段的文档列表组成。类比数据库中的表。

Document

Document 由 json Object 组成。每个字段可以指定类型。每个Document还包含元数据。

字段类型Field type

字段类型概述
一级分类	二级分类	具体类型
核心类型	字符串类型	text,keyword
整数类型	integer,long,short,byte
浮点类型	double,float,half_float,scaled_float
逻辑类型	boolean
日期类型	date
范围类型	range
二进制类型	binary
复合类型	数组类型	array
对象类型	object
嵌套类型	nested
地理类型	地理坐标类型	geo_point
地理地图	geo_shape
特殊类型	IP类型	ip
范围类型	completion
令牌计数类型	token_count
附件类型	attachment
抽取类型	percolator

text类型

当一个字段是要被全文搜索的，比如Email内容、产品描述，应该使用text类型。设置text类型以后，字段内容会被分析，在生成倒排索引以前，字符串会被分析器分成一个一个词项。text类型的字段不用于排序，很少用于聚合（termsAggregation除外）。

把full_name字段设为text类型的Mapping如下：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

keyword类型

keyword类型适用于索引结构化的字段，比如email地址、主机名、状态码和标签。如果字段需要进行过滤(比如查找已发布博客中status属性为published的文章)、排序、聚合。keyword类型的字段只能通过精确值搜索到。

数字类型

对于数字类型，ELasticsearch支持以下几种。

类型	取值范围
long	-2^63至263-1
integer	-2^31至231-1
short	-32,768至32768
byte	-128至127
double	64位双精度IEEE 754浮点类型
float	32位单精度IEEE 754浮点类型
half_float	16位半精度IEEE 754浮点类型
scaled_float	缩放类型的的浮点数（比如价格只需要精确到分，price为57.34的字段缩放因子为100，存起来就是5734）

对于float、half_float和scaled_float,-0.0和+0.0是不同的值，使用term查询查找-0.0不会匹配+0.0，同样range查询中上边界是-0.0不会匹配+0.0，下边界是+0.0不会匹配-0.0。

对于数字类型的数据，选择以上数据类型的注意事项：

在满足需求的情况下，尽可能选择范围小的数据类型。比如，某个字段的取值最大值不会超过100，那么选择byte类型即可。迄今为止吉尼斯记录的人类的年龄的最大值为134岁，对于年龄字段，short足矣。字段的长度越短，索引和搜索的效率越高。
优先考虑使用带缩放因子的浮点类型。

例子：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

Object类型

JSON天生具有层级关系，文档会包含嵌套的对象：

PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

上面的文档中，整体是一个JSON，JSON中包含一个manager,manager又包含一个name。最终，文档会被索引成一平的key-value对：

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"

上面文档结构的Mapping如下：

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { 
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}

date类型

JSON中没有日期类型，所以在ELasticsearch中，日期类型可以是以下几种：

日期格式的字符串：e.g. “2015-01-01” or “2015/01/01 12:10:30”.

long类型的毫秒数( milliseconds-since-the-epoch)
integer的秒数(seconds-since-the-epoch)
日期格式可以自定义，如果没有自定义，默认格式如下：

日期格式可以自定义，如果没有自定义，默认格式如下：

"strict_date_optional_time||epoch_millis"

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{ "date": "2015-01-01" } 
 
PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" } 
 
PUT my_index/my_type/3
{ "date": 1420070400001 } 
 
GET my_index/_search
{
  "sort": { "date": "asc"} 
}

查看三个日期类型：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "date": "2015-01-01"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "date": 1420070400001
        }
      }
    ]
  }
}

Array类型

ELasticsearch没有专用的数组类型，默认情况下任何字段都可以包含一个或者多个值，但是一个数组中的值要是同一种类型。例如：

字符数组: [ “one”, “two” ]
整型数组：[1,3]
嵌套数组：[1,[2,3]],等价于[1,2,3]
对象数组：[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

注意事项：

动态添加数据时，数组的第一个值的类型决定整个数组的类型
混合数组类型是不支持的，比如：[1,”abc”]
数组可以包含null值，空数组[ ]会被当做missing field对待。

binary类型

binary类型接受base64编码的字符串，默认不存储也不可搜索。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

binary类型

binary类型接受base64编码的字符串，默认不存储也不可搜索。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

搜索blog字段：

GET my_index/_search
{
  "query": {
    "match": {
      "blob": "test" 
    }
  }
}
 
返回结果：
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Binary fields do not support searching",
        "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
        "index": "my_index"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "3dQd1RRVTMiKdTckM68nPQ",
        "reason": {
          "type": "query_shard_exception",
          "reason": "Binary fields do not support searching",
          "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
          "index": "my_index"
        }
      }
    ]
  },
  "status": 400
}

ip类型

ip类型的字段用于存储IPV4或者IPV6的地址。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{
  "ip_addr": "192.168.1.1"
}
 
GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

range类型

range类型支持以下几种：

类型	范围
integer_range	-2^31至231-1
float_range	32-bit IEEE 754
long_range	-2^63至263-1
double_range	64-bit IEEE 754
date_range	64位整数，毫秒计时

range类型的使用场景：比如前端的时间选择表单、年龄范围选择表单等。

例子：

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}
 
PUT range_index/my_type/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

上面代码创建了一个range_index索引，expected_attendees的人数为10到20，时间是2015-10-31 12:00:00至2015-11-01。

查询：

POST range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-08-01",
        "lte" : "2015-12-01",
        "relation" : "within" 
      }
    }
  }
}

查询结果：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "range_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "expected_attendees": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2015-10-31 12:00:00",
            "lte": "2015-11-01"
          }
        }
      }
    ]
  }
}

nested类型
nested嵌套类型是object中的一个特例，可以让array类型的Object独立索引和查询。使用Object类型有时会出现问题，比如文档 my_index/my_type/1的结构如下：

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

user字段会被动态添加为Object类型。最后会被转换为以下平整的形式：

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

user.first和user.last会被平铺为多值字段，Alice和White之间的关联关系会消失。上面的文档会不正确的匹配以下查询(虽然能搜索到,实际上不存在Alice Smith)：

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

使用nested字段类型解决Object类型的不足：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
 
GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}
 
GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

token_count类型

token_count用于统计词频：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "length": { 
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{ "name": "John Smith" }
 
PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }
 
GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 
    }
  }
}

geo point 类型

地理位置信息类型用于存储地理位置信息的经纬度：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}
 
PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}
 
PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}
 
PUT my_index/my_type/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" 
}
 
PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}
 
GET my_index/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

Meta-Fields(元数据)

_all（默认禁用）

_all字段是把其它字段拼接在一起的超级字段，所有的字段用空格分开，_all字段会被解析和索引，但是不存储。当你只想返回包含某个关键字的文档但是不明确地搜某个字段的时候就需要使用_all字段。但是由于他会对所有文档内容分词，比较占空间，已经被禁用。

例子：

PUT my_index/blog/1 
{
  "title":    "Master Java",
  "content":     "learn java",
  "author": "Tom"
}

_all字段包含:[ “Master”, “Java”, “learn”, “Tom” ]

_field_names

_field_names字段用来存储文档中的所有非空字段的名字，这个字段常用于exists查询。例子如下:

PUT my_index/my_type/1
{
  "title": "This is a document"
}
 
PUT my_index/my_type/2?refresh=true
{
  "title": "This is another document",
  "body": "This document has a body"
}
 
GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "body" ] 
    }
  }
}

结果会返回第二条文档，因为第一条文档没有title字段。同样，可以使用exists查询：

GET my_index/_search
{
    "query": {
        "exists" : { "field" : "body" }
    }

_id

每条被索引的文档都有一个_type和_id字段，_id可以用于term查询、temrs查询、match查询、query_string查询、simple_query_string查询，但是不能用于聚合、脚本和排序。例子如下：

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}
 
PUT my_index/my_type/2
{
  "text": "Document with ID 2"
}
 
GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] 
    }
  }
}

** _index **

多索引查询时，有时候只需要在特地索引名上进行查询，_index字段提供了便利，也就是说可以对索引名进行term查询、terms查询、聚合分析、使用脚本和排序。

_index是一个虚拟字段，不会真的加到Lucene索引中，对_index进行term、terms查询(也包括match、query_string、simple_query_string)，但是不支持prefix、wildcard、regexp和fuzzy查询。

举例，2个索引2条文档

 
PUT index_1/my_type/1
{
  "text": "Document in index 1"
}
 
PUT index_2/my_type/2
{
  "text": "Document in index 2"
}

对索引名做查询、聚合、排序并使用脚本新增字段：

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "inline": "doc['_index']" 
      }
    }
  }
}

_parent
_parent用于指定同一索引中文档的父子关系。下面例子中现在mapping中指定文档的父子关系，然后索引父文档，索引子文档时指定父id，最后根据子文档查询父文档。

PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}
 
 
PUT my_index/my_parent/1 
{
  "text": "This is a parent document"
}
 
PUT my_index/my_child/2?parent=1 
{
  "text": "This is a child document"
}
 
PUT my_index/my_child/3?parent=1&refresh=true 
{
  "text": "This is another child document"
}
 
 
GET my_index/my_parent/_search
{
  "query": {
    "has_child": { 
      "type": "my_child",
      "query": {
        "match": {
          "text": "child document"
        }
      }
    }
  }
}

** _routing**

路由参数，ELasticsearch通过以下公式计算文档应该分到哪个分片上：

shard_num = hash(_routing) % num_primary_shards

默认的_routing值是文档的_id或者_parent，通过_routing参数可以设置自定义路由。例如，想把user1发布的博客存储到同一个分片上，索引时指定routing参数，查询时在指定路由上查询：

PUT my_index/my_type/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}
 
GET my_index/my_type/1?routing=user1

在查询的时候通过routing参数查询：

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  }
}
 
GET my_index/_search?routing=user1,user2 
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}

在Mapping中指定routing为必须的：

PUT my_index2
{
  "mappings": {
    "my_type": {
      "_routing": {
        "required": true 
      }
    }
  }
}
 
PUT my_index2/my_type/1 
{
  "text": "No routing value provided"
}

** _source**

存储的文档的原始值。默认_source字段是开启的，也可以关闭：

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

但是一般情况下不要关闭，除法你不想做一些操作：

使用update、update_by_query、reindex
使用高亮
数据备份、改变mapping、升级索引
通过原始字段debug查询或者聚合

_type

每条被索引的文档都有一个_type和_id字段，可以根据_type进行查询、聚合、脚本和排序。例

PUT my_index/type_1/1
{
  "text": "Document with type 1"
}
 
PUT my_index/type_2/2?refresh=true
{
  "text": "Document with type 2"
}
 
GET my_index/_search
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "inline": "doc['_type']" 
      }
    }
  }
}

_uid

uid和_type和_id的组合。和_type一样，可用于查询、聚合、脚本和排序。由于之后的版本已经对type进行删除，所以_uid和_id相同

例子如下：

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}
 
PUT my_index/my_type/2?refresh=true
{
  "text": "Document with ID 2"
}
 
GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "my_type#1", "my_type#2" ] 
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "inline": "doc['_uid']" 
      }
    }
  }
}

总结

本篇主要介绍Document中可以包含的数据格式和用于标注文档的相关信息，它类型数据库中表的结构，表明储存数据的格式。

微笑看你哭

发布了44 篇原创文章 · 获赞 9 · 访问量 1万+

私信关注