文档章节

elasticsearch 的滚动(scroll)

元禛慎独
 元禛慎独
发布于 2017/06/27 11:30
字数 1060
阅读 42
收藏 0

Elasticsearch Reference [2.0] » Search APIs » Request Body Search » Scroll

«  Search Type    Preference  »

Scrolledit

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

Client support for scrolling and reindexing

Some of the officially supported clients provide helpers to assist with scrolled searches and reindexing of documents from one index to another:

Perl

See Search::Elasticsearch::Bulk and Search::Elasticsearch::Scroll

Python

See elasticsearch.helpers.*

Note

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.

curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
'

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.

curl -XGET  'localhost:9200/_search/scroll'  -d'
{
    "scroll" : "1m", 
    "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 
}
'

Note

Added in 2.0.0-beta1.

body based parameters were added in 2.0.0

GET or POST can be used.

The URL should not include the index or type name — these are specified in the originalsearch request instead.

The scroll parameter tells Elasticsearch to keep the search context open for another 1m.

The scroll_id parameter

Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.

For backwards compatibility, scroll_id and scroll can be passed in the query string. And thescroll_id can be passed in the request body

curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1'

Important

The initial search request and each subsequent scroll request returns a new_scroll_id — only the most recent _scroll_id should be used.

Note

If the request specifies aggregations, only the initial search response will contain the aggregations results.

Efficient scrolling with Scroll-Scanedit

Deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.

The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.

Normally, you just want to retrieve all results and the order doesn’t matter. Scrolling can be combined with the scan search type to disable any scoring or sorting and to return results in the most efficient way possible. All that is needed is to add search_type=scan to the query string of the initial search request:

curl 'localhost:9200/twitter/tweet/_search?scroll=1m&search_type=scan'  -d '
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
'

Setting search_type to scan disables sorting and makes scrolling very efficient.

A scanning scroll request differs from a standard scroll request in four ways:

  • No score is calculated and sorting is disabled. Results are returned in the order they appear in the index.
  • Aggregations are not supported.
  • The response of the initial search request will not contain any results in the hits array. The first results will be returned by the first scroll request.
  • The size parameter controls the number of results per shard, not per request, so a size of10 which hits 5 shards will return a maximum of 50 results per scroll request.

If you want the scoring to happen, even without sorting on it, set the track_scores parameter totrue.

Keeping the search context aliveedit

The scroll parameter (passed to the search request and to every scroll request) tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m, see the section called “Time unitsedit”) does not need to be long enough to process all data — it just needs to be long enough to process the previous batch of results. Each scroll request (with the scroll parameter) sets a new expiry time.

Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.

Tip

Keeping older segments alive means that more file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See the section called “File Descriptorsedit”.

You can check how many search contexts are open with the nodes stats API:

curl -XGET localhost:9200/_nodes/stats/indices/search?pretty

Clear scroll APIedit

Search context are automatically removed when the scroll timeout has been exceeded. However keeping scrolls open has a cost, as discussed in the previous section so scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API:

curl -XDELETE localhost:9200/_search/scroll -d '
{
    "scroll_id" : ["c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"]
}'

Note

Added in 2.0.0-beta1.

Body based parameters were added in 2.0.0

Multiple scroll IDs can be passed as array:

curl -XDELETE localhost:9200/_search/scroll -d '
{
    "scroll_id" : ["c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1", "aGVuRmV0Y2g7NTsxOnkxaDZ"]
}'

Note

Added in 2.0.0-beta1.

Body based parameters were added in 2.0.0

All search contexts can be cleared with the _all parameter:

curl -XDELETE localhost:9200/_search/scroll/_all

The scroll_id can also be passed as a query string parameter or in the request body. Multiple scroll IDs can be passed as comma separated values:

curl -XDELETE localhost:9200/_search/scroll \
     -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1,aGVuRmV0Y2g7NTsxOnkxaDZ'

«  Search Type  

翻译版详见 http://www.jianshu.com/p/14aa8b09c789

© 著作权归作者所有

元禛慎独
粉丝 3
博文 209
码字总数 60366
作品 0
朝阳
程序员
私信 提问
Elasticsearch——分页查询From&Size VS scroll

Elasticsearch中数据都存储在分片中,当执行搜索时每个分片独立搜索后,数据再经过整合返回。那么,如果要实现分页查询该怎么办呢? 更多内容参考Elasticsearch资料汇总 按照一般的查询流程来...

xiaomin0322
2018/06/13
103
0
Elasticsearch Rest Client bboss v5.6.9 发布

The best Elasticsearch Highlevel Rest Client API-----bboss v5.6.9 发布。 bboss elasticsearch 是一套基 于query dsl 语法操作和访问分布式搜索引擎 elasticsearch 的 o/r mapping 高性能......

bboss
05/13
1K
2
Elasticsearch 6.0.0 正式发布,带来大量新特性

在 Elasticsearch 5.0.0 发布之后,Elasticsearch 在333个 commite、2236 个合并请求下,发布了基于 Lucene 7.0.1 的 Elasticsearch 6.0.0 正式版。 Elasticsearch 6.0.0 下载地址 Elastics...

王练
2017/11/15
7.6K
21
bboss elasticsearch v5.0.6.0 发布

bboss elasticsearch v5.0.6.0 发布 bboss elasticsearch是一款高性能的elasticsearch orm java客户端框架,具备以下主要特性: 简单易用:基于xml配置和管理dsl,在dsl脚本中可以使用变量、...

bboss
2018/04/20
1K
2
Elasticsearch Scroll和Slice Scroll查询API使用案例

Elasticsearch Scroll和Slice Scroll查询API使用案例 the best elasticsearch highlevel java rest api-----bboss 本文内容 基本scroll api使用 基本scroll api与自定义scorll结果集handler......

bboss
2018/09/04
1K
2

没有更多内容

加载失败,请刷新页面

加载更多

OSChina 周四乱弹 —— 干啥啥不行,吃饭第一名。

Osc乱弹歌单(2019)请戳(这里) 【今日歌曲】 @巴拉迪维 : 李白到杜甫家吃饭。杜甫洗菜,发现只有青瓜和萝卜,心中愧疚。这时,他看见了邻居家的鸡,杜甫一时心酸,忍不住喃喃自语:我希望...

小小编辑
9分钟前
31
4
Java描述设计模式(08):桥接模式

本文源码:GitHub·点这里 || GitEE·点这里 一、桥接模式简介 1、基础描述 桥梁模式是对象的结构模式。又称为柄体(Handle and Body)模式或接口(Interface)模式。桥梁模式的用意是“将抽象化...

知了一笑
9分钟前
4
0
C++ 函数指针的用法

天王盖地虎626
23分钟前
1
0
白话比原链跨链技术

随着Bystack的主侧链架构的推出,主侧链之间的跨链问题也成为比原链团队的主要攻克工程难题,当前比原链已经推出了两种跨链的机制,各有不同的侧重点,可能因为本身的跨链技术比较晦涩,本篇...

比原链Bytom
32分钟前
2
0
PostgreSQL参数search_path影响及作用

search_path稍微熟悉PG就会用到,用法这里就不必讲,本篇主要讲它在程序里怎样处理。 1、GUC参数定义 这是个 config_string 参数 {{"search_path", PGC_USERSET, CLIENT_CONN_STATEMENT,...

有理想的猪
今天
9
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部