如何解决elaseticsearch获取数据数量限制问题

前言

search 我们经常使用，默认一次返回10条数据，并且可以通过 from 和 size 参数修改返回条数并执行分页操作。但是有时需要返回大量数据，就必须通过scan和scroll实现。两者一起使用来从Elasticsearch里高效地取回巨大数量的结果而不需要付出深分页的代价。
详情参考：https://es.xiaoleilu.com/060_Distributed_Search/20_Scan_and_scroll.html
与上文链接不同的是，本文是关于python实现的介绍和描述。

数据说明

索引hz中一共29999条数据，且内容如下。批量导入数据代码可见：
http://blog.csdn.net/xsdxs/article/details/72849796
这里写图片描述

代码示例

ES客户端代码：

# -*- coding: utf-8 -*-

import elasticsearch

ES_SERVERS = [{ 'host': 'localhost', 'port': 9200 }]

es_client = elasticsearch.Elasticsearch( hosts=ES_SERVERS )

search接口搜索代码：

# -*- coding: utf-8 -*-
from es_client import es_client


def search(search_offset, search_size):
    es_search_options = set_search_optional()
    es_result = get_search_result(es_search_options, search_offset, search_size)
    final_result = get_result_list(es_result)
    return final_result


def get_result_list(es_result):
    final_result = []
    result_items = es_result['hits']['hits']
    for item in result_items:
        final_result.append(item['_source'])
    return final_result


def get_search_result(es_search_options, search_offset, search_size, index='hz', doc_type='xyd'):
    es_result = es_client.search(
        index=index,
        doc_type=doc_type,
        body=es_search_options,
        from_=search_offset,
        size=search_size
    )
    return es_result


def set_search_optional():
    # 检索选项
    es_search_options = {
        "query": {
            "match_all": {}
        }
    }
    return es_search_options


if __name__ == '__main__':
    final_results = search(0, 1000)
    print len(final_results)

这样一切貌似ok，正常输出1000，但是现在改下需求，想搜索其中20000条数据。

if __name__ == '__main__':
    final_results = search(0, 20000)

输出如下错误：

elasticsearch.exceptions.TransportError: TransportError(500, u’search_phase_execution_exception’, u’Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.’)

说明：search接口最多返回1w条数据。所以这里会报错。
不废话，基于scan和scroll实现，直接给代码如下：

# -*- coding: utf-8 -*-
from es_client import es_client
from elasticsearch import helpers


def search():
    es_search_options = set_search_optional()
    es_result = get_search_result(es_search_options)
    final_result = get_result_list(es_result)
    return final_result


def get_result_list(es_result):
    final_result = []
    for item in es_result:
        final_result.append(item['_source'])
    return final_result


def get_search_result(es_search_options, scroll='5m', index='hz', doc_type='xyd', timeout="1m"):
    es_result = helpers.scan(
        client=es_client,
        query=es_search_options,
        scroll=scroll,
        index=index,
        doc_type=doc_type,
        timeout=timeout
    )
    return es_result


def set_search_optional():
    # 检索选项
    es_search_options = {
        "query": {
            "match_all": {}
        }
    }
    return es_search_options


if __name__ == '__main__':
    final_results = search()
    print len(final_results)

输出如下：
这里写图片描述
把29999条数据全部取出来了。