Elasticsearch基础操作与对应的curl命令行，python对接实现

前言

Elasticsearch 实际上提供了一系列 Restful API 来进行存取和查询操作，我们可以使用 curl 等命令来进行操作，但毕竟命令行模式没那么方便，所以这里我们在介绍操作对应curl命令同时也如何利用 Python 来对接 Elasticsearch 实现同样的事情。

python对接Elasticsearch方式

Python 中对接 Elasticsearch 使用的就是一个同名的库，安装方式非常简单：

pip3 install elasticsearch

官方英文文档是URL

对索引(index)的操作

创建索引index

如果我们只是创建一个不需要进行分词操作的index,那么就不需要指定其Mapping,比如建一个名叫test的index。下面解释什么是Mapping!

curl命令

在终端下 curl命令：

curl -X PUT 'localhost:9200/test'

python

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
result = es.indices.create(index='test', ignore=400)
print(result)

如果成功创建后，都会返回如下结果

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'test'}

但是注意，如果es里已经创建了同名叫test的index，会返回状态码为400的错误，这样程序的执行就会出现问题。

所以说，我们需要善用ignore参数，把一些意外情况排除，这样可以保证程序的正常执行而不会中断。所以上面python代码里ignore=400就是忽略创建重复的错误。

什么是mapping呢?

发现没，虽然已经创建index,但我们并没有指定其格式，即它都有啥属性呢?

mapping在 Elasticsearch 中的作用就是约束。即在创建index的时候，就指定它一种类型。首先它声明了这类型具有哪些数据类型的属性值，每种属性值的类型又是什么，同时为这个属性选择了分词器。

比如：

curl命令

利用curl命令首先新建一个名称为accounts的 Index，里面有一个名称为person的 Type。person有三个字段：

$ curl -X PUT 'localhost:9200/accounts' -d '
{
  "mappings": {
    "person": {
      "properties": {
        "user": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "title": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "desc": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}'

user
title
desc

这三个字段都是中文，而且类型都是文本（text），所以需要指定中文分词器，不能使用默认的英文分词器。
Elastic 的分词器称为 analyzer。我们对每个字段指定分词器。

如下面总结的。

"user": {
  "type": "text",
  "analyzer": "ik_max_word",
  "search_analyzer": "ik_max_word"
}

上面代码中，analyzer是字段文本的分词器，search_analyzer是搜索词的分词器。ik_max_word分词器是插件ik提供的，可以对文本进行最大数量的分词。

ik除了ik_max_word还有ik_smart
其中我们在做索引的时候，希望能将所有的句子切分的更详细，以便更好的搜索，所以ik_max_word更多的用在做索引的时候，但是在搜索的时候，对于用户所输入的query(查询)词，我们可能更希望得比较准确的结果，例如，我们搜索“无花果”的时候，更希望是作为一个词进行查询，而不是切分为"无"，“花”，“果”三个词进行结果的召回，因此ik_smart更加常用语对于输入词的分析。

关于ik分词器的安装可以看我之前的博文。

python

python写法只需要将mapping定义成一个字典，作为函数参数传进去。

mapping = {
    'properties': {
        'title': {
            'type': 'text',
            'analyzer': 'ik_max_word',
            'search_analyzer': 'ik_max_word'
        }
    }
}

from elasticsearch import Elasticsearch
es = Elasticsearch()
result = es.indices.create(index='test', body = mapping,ignore=400)

删除索引index

curl命令

curl 命令很简单直接 delete 标签名字！

curl -X DELETE 'localhost:9200/weather'

python


def delete_index(self , index_name ):
    try:
        self.es.indices.delete(index = index_name)
        print('Delete index [%s] successful!'% index_name)
    except Exception as e:
        print('Delete failed : ',e)

查看所有索引

cur命令：

curl -X GET 'http://localhost:9200/_cat/indices?v'

下面的命令可以将index包含每个type输出

 curl 'localhost:9200/_mapping?pretty=true'

python

文档记录的操作

新增数据

在每个索引(index)里面增加一条记录(document)的操作：既可以指定该记录(document)Id,也可以不指定，让系统随机生成该记录的Id

curl命令：

新增指定id的记录

向指定的 /Index/Type发送PUT请求，就可以在 Index 里面新增一条指定Id的记录。比如，向/accounts/person发送请求，就可以新增一条id =1 的人员记录。

$ curl -X PUT 'localhost:9200/accounts/person/1' -d '
{
  "user": "张三",
  "title": "工程师",
  "desc": "数据库管理"
}'

服务器返回的 JSON 对象，会给出 Index、Type、Id、Version 等信息。

$ curl -X PUT 'localhost:9200/accounts/person/1' -d '
{
  "user": "张三",
  "title": "工程师",
  "desc": "数据库管理"
}'

如果你仔细看，会发现请求路径是/accounts/person/1，最后的1是该条记录的 Id。它不一定是数字，任意字符串（比如abc）都可以。

新增不指定id的记录

新增记录的时候，也可以不指定 Id，这时要改成 POST请求。

$ curl -X POST 'localhost:9200/accounts/person' -d '
{
  "user": "李四",
  "title": "工程师",
  "desc": "系统管理"
}'

上面代码中，向/accounts/person发出一个 POST 请求，添加一个记录。这时，服务器返回的 JSON 对象里面，_id字段就是一个随机字符串。

{
  "_index":"accounts",
  "_type":"person",
  "_id":"AV3qGfrC6jMbsbXb6k1p",
  "_version":1,
  "result":"created",
  "_shards":{"total":2,"successful":1,"failed":0},
  "created":true
}

注意，如果没有先创建 Index（这个例子是accounts），直接执行上面的命令，Elastic 也不会报错，而是直接生成指定的 Index。所以，打字的时候要小心，不要写错 Index 的名称。

python

新增指定id的记录

Elasticsearch 就像 MongoDB 一样，在插入数据的时候可以直接插入结构化字典数据，插入数据可以调用 create() 方法，例如这里我们插入一条新闻数据：

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
es.indices.create(index='news', ignore=400)
 
data = {'title': '美国留给伊拉克的是个烂摊子吗', 'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm'}
result = es.create(index='news', doc_type='politics', id=1, body=data)
print(result)

这里我们首先声明了一条新闻数据，包括标题和链接，然后通过调用 create() 方法插入了这条数据，在调用 create() 方法时，我们传入了四个参数，index 参数代表了索引名称，doc_type 代表了文档类型，body 则代表了文档具体内容，id 则是数据的唯一标识 ID。

运行结果如下：

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}

结果中 result 字段为 created，代表该数据插入成功。

新增不指定id的记录

另外其实我们也可以使用 index() 方法来插入数据，但与 create() 不同的是，create() 方法需要我们指定 id 字段来唯一标识该条数据，而 index() 方法则不需要，如果不指定 id，会自动生成一个 id，调用 index() 方法的写法如下：

es.index(index='news', doc_type='politics', body=data)

可以看出来，create() 方法内部其实也是调用了 index() 方法，是对 index() 方法的封装。

删除记录

如果想删除一条记录就必须指定该记录的Id。

curl命令

$ curl -X DELETE 'localhost:9200/accounts/person/1'

其中1即是Id

Python

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
result = es.delete(index='news', doc_type='politics', id=1)
print(result)

运行的结果如下：

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 3, 'result': 'deleted', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}

可以看到运行结果中 result 字段为 deleted，代表删除成功，_version 变成了 3，又增加了 1。

更新记录

curl命令：

更新记录就是使用 PUT 请求，重新发送一次数据。

curl -X PUT 'localhost:9200/accounts/person/1' -d '
{
    "user" : "张三",
    "title" : "工程师",
    "desc" : "数据库管理，软件开发"
}' 

返回 

{
  "_index":"accounts",
  "_type":"person",
  "_id":"1",
  "_version":2,
  "result":"updated",
  "_shards":{"total":2,"successful":1,"failed":0},
  "created":false
}

上面代码中，我们将原始数据从"数据库管理"改成"数据库管理，软件开发"。返回结果里面，有几个字段发生了变化。

"_version" : 2,
"result" : "updated",
"created" : false

可以看到，记录的 Id 没变，但是版本（version）从1变成2，操作类型（result）从created变成updated，created字段变成false，因为这次不是新建记录。

python

更新数据也非常简单，我们同样需要指定数据的 id 和内容，调用 update() 方法即可，代码如下：

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
data = {
    'title': '美国留给伊拉克的是个烂摊子吗',
    'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
    'date': '2011-12-16'
}
result = es.update(index='news', doc_type='politics', body=data, id=1)

结果

{'_index': 'news', '_type': 'politics', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}

可以看到返回结果中，result 字段为 updated，即表示更新成功，另外我们还注意到有一个字段 _version，这代表更新后的版本号数，2 代表这是第二个版本，因为之前已经插入过一次数据，所以第一次插入的数据是版本 1，可以参见上例的运行结果，这次更新之后版本号就变成了 2，以后每更新一次，版本号都会加 1。

另外更新操作其实利用 index() 方法同样可以做到，写法如下：

es.index(index='news', doc_type='politics', body=data, id=1)

可以看到，index() 方法可以代替我们完成两个操作，如果数据不存在，那就执行插入操作，如果已经存在，那就执行更新操作，非常方便。

查询记录

上面的几个操作都是非常简单的操作，普通的数据库如 MongoDB 都是可以完成的，看起来并没有什么了不起的，Elasticsearch 更特殊的地方在于其异常强大的检索功能。

对于中文来说，我们需要安装一个Ik分词器，安装过程可以参考博文。

在指定Ik分词器后，即可以实现在全文里针对某个属性的模糊查询了。

首先我们看一下返回所有记录的命令：

返回索引里所有记录

curl命令

使用 GET 方法，直接请求/Index/Type/_search，就会返回所有记录。

curl 'localhost:9200/accounts/person/_search'

{
  "took":2,
  "timed_out":false,
  "_shards":{"total":5,"successful":5,"failed":0},
  "hits":{
    "total":2,
    "max_score":1.0,
    "hits":[
      {
        "_index":"accounts",
        "_type":"person",
        "_id":"AV3qGfrC6jMbsbXb6k1p",
        "_score":1.0,
        "_source": {
          "user": "李四",
          "title": "工程师",
          "desc": "系统管理"
        }
      },
      {
        "_index":"accounts",
        "_type":"person",
        "_id":"1",
        "_score":1.0,
        "_source": {
          "user" : "张三",
          "title" : "工程师",
          "desc" : "数据库管理，软件开发"
        }
      }
    ]
  }
}

上面代码中，返回结果的 took字段表示该操作的耗时（单位为毫秒），timed_out字段表示是否超时，hits字段表示命中的记录，里面子字段的含义如下。

total：返回记录数，本例是2条。
max_score：最高的匹配程度，本例是1.0。
hits：返回的记录组成的数组。

返回的记录中，每条记录都有一个_score字段，表示匹配的程序，默认是按照这个字段降序排列。

python：

result = es.search(index='news', doc_type='politics')
print(result)

带条件的全文搜索：

curl

$ curl 'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "软件" }}
}'

上面代码使用 Match 查询，这里accounts 是index名称,person是index的type名称。指定的匹配条件是desc字段里面包含"软件"这个词。
返回结果如下：

{
  "took":3,
  "timed_out":false,
  "_shards":{"total":5,"successful":5,"failed":0},
  "hits":{
    "total":1,
    "max_score":0.28582606,
    "hits":[
      {
        "_index":"accounts",
        "_type":"person",
        "_id":"1",
        "_score":0.28582606,
        "_source": {
          "user" : "张三",
          "title" : "工程师",
          "desc" : "数据库管理，软件开发"
        }
      }
    ]
  }
}

Elastic 默认一次返回10条结果，可以通过size字段改变这个设置。


$ curl 'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "管理" }},
  "size": 1
}'

上面代码指定，每次只返回一条结果。

还可以通过from字段，指定位移。

 curl 'localhost:9200/accounts/person/_search'  -d '
{
  "query" : { "match" : { "desc" : "管理" }},
  "from": 1,
  "size": 1
}'

python

python全文检索只需要将查询条件dsl当做参数传进去～
测试前首先新增数据

datas = [
    {
        'title': '美国留给伊拉克的是个烂摊子吗',
        'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
        'date': '2011-12-16'
    },
    {
        'title': '公安部：各地校车将享最高路权',
        'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',
        'date': '2011-12-16'
    },
    {
        'title': '中韩渔警冲突调查：韩警平均每天扣1艘中国渔船',
        'url': 'https://news.qq.com/a/20111216/001044.htm',
        'date': '2011-12-17'
    },
    {
        'title': '中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首',
        'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',
        'date': '2011-12-18'
    }
]

然后运行下面代码

dsl = {
    'query': {
        'match': {
            'title': '中国 领事馆'
        }
    }
}
 
es = Elasticsearch()
result = es.search(index='news', doc_type='politics', body=dsl)
print(json.dumps(result, indent=2, ensure_ascii=False))

这里我们使用 Elasticsearch 支持的 DSL 语句来进行查询，使用 match 指定全文检索，检索的字段是 title，内容是“中国领事馆”，搜索结果如下：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.546152,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 2.546152,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击，嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 0.2876821,
        "_source": {
          "title": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      }
    ]
  }
}

hits里面total = 2显示符合条件的document是两条！

当然查询操作的条件还可以组合等等写的更复杂一些。所以关于搜索的python api使用可以学习博客或者看官方文档。

参考的学习资源

ES官网python api文档
 阮一峰：全文检索
 崔庆才：python操作ES
关于ES搜索python api