文档章节

Elasticsearch 的GC 理解

Faye_Cai
 Faye_Cai
发布于 2016/06/21 17:07
字数 1888
阅读 572
收藏 3

看了点ES的GC,记录下来自己的理解,用英文写的,懒得再翻译成中文,见谅

GC is required to keep an eye on it  since Elasticsearch is based on JVM

1) Avoid long GC for old gen .

A cluster that is frequently experiencing long GC will be a cluster that is under heavy load with not enough memory. These long GCs will make nodes drop off the cluster for brief periods in ES

discovery.zen.ping.timeout: 3s is the default config , if one ES node has a longer gc , since it is "stop the world" when gc, the master can't get ping response , if it exceeded the time (please also extend the discovery.zen.ping.timeout setting appropriately in your config) , then the node will be removed from the cluster .This instability causes shards to relocate frequently as Elasticsearch tries to keep the cluster balanced and enough replicas available. This in turn increases network traffic and disk I/O, all while your cluster is attempting to service the normal indexing and query load.

In short, long GCs are bad and need to be minimized as much as possible.  Hence, set appropriate ES heap is good instead of the better the bigger since larger heap will have longer gc time .

When one of node left ES cluster , except checking the network issue , whether longer gc happened is also a check point .

 

Garbage Collection Primer :

When memory is allocated to a JVM process, it is allocated in a big chunk called the heap. The JVM then breaks the heap into two groups, referred to as generations:

Young (or Eden)

The space where newly instantiated objects are allocated. The young generation space is often quite small, usually 100 MB–500 MB. The young-gen also contains two survivor spaces.

Old

The space where older objects are stored. These objects are expected to be long-lived and persist for a long time. The old-gen is often much larger than the young-gen, and Elasticsearch nodes can see old-gens as large as 30 GB.

When an object is instantiated, it is placed into young-gen. When the young generation space is full, a young-gen garbage collection (GC) is started. Objects that are still "alive" are moved into one of the survivor spaces, and "dead" objects are removed. If an object has survived several young-gen GCs, it will be "tenured" into the old generation.

A similar process happens in the old generation: when the space becomes full, a garbage collection is started and dead objects are removed.

Nothing comes for free, however. Both the young- and old-generation garbage collectors have phases that "stop the world." During this time, the JVM literally halts execution of the program so it can trace the object graph and collect dead objects. During this stop-the-world phase, nothing happens. Requests are not serviced, pings are not responded to, shards are not relocated. The world quite literally stops.

This isn’t a big deal for the young generation; its small size means GCs execute quickly. But the old-gen is quite a bit larger, and a slow GC here could mean 1s or even 15s of pausing—which is unacceptable for server software.

refer to : https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_jvm_section

 

2) How to monitor GC of your ES ?

1. By New Relic monitor 

2. By GC log

(we won't touch how to set gc log here )

A segment from my gc.log with old gen gc and young gen gc :

2016-06-20T13:30:21.737+0000: 14764.454: Total time for which application threads were stopped: 0.0018031 seconds, Stopping threads took: 0.0015621 seconds
2016-06-20T13:30:21.878+0000: 14764.595: Total time for which application threads were stopped: 0.0049363 seconds, Stopping threads took: 0.0046086 seconds
2016-06-20T13:30:21.988+0000: 14764.705: Total time for which application threads were stopped: 0.0016491 seconds, Stopping threads took: 0.0012934 seconds
2016-06-20T13:30:22.381+0000: 14765.098: [GC (CMS Initial Mark) [1 CMS-initial-mark: 45998K(1756416K)] 217877K(2063104K), 0.0292245 secs] [Times: user=0.12 sys=0.00, real=0.03 secs # old generation gc start
2016-06-20T13:30:22.410+0000: 14765.127: Total time for which application threads were stopped: 0.0297773 seconds, Stopping threads took: 0.0001067 seconds
2016-06-20T13:30:22.410+0000: 14765.127: [CMS-concurrent-mark-start]
2016-06-20T13:30:22.475+0000: 14765.192: [CMS-concurrent-mark: 0.065/0.065 secs] [Times: user=0.21 sys=0.00, real=0.06 secs]
2016-06-20T13:30:22.475+0000: 14765.192: [CMS-concurrent-preclean-start]
2016-06-20T13:30:22.484+0000: 14765.201: [CMS-concurrent-preclean: 0.008/0.009 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2016-06-20T13:30:22.484+0000: 14765.201: [CMS-concurrent-abortable-preclean-start]
2016-06-20T13:30:22.658+0000: 14765.374: Total time for which application threads were stopped: 0.0035706 seconds, Stopping threads took: 0.0020646 seconds
2016-06-20T13:30:22.660+0000: 14765.377: Total time for which application threads were stopped: 0.0007947 seconds, Stopping threads took: 0.0000374 seconds
2016-06-20T13:30:23.660+0000: 14766.377: Total time for which application threads were stopped: 0.0004326 seconds, Stopping threads took: 0.0000737 seconds
2016-06-20T13:30:23.914+0000: 14766.631: Total time for which application threads were stopped: 0.0003030 seconds, Stopping threads took: 0.0000469 seconds
2016-06-20T13:30:23.915+0000: 14766.632: Total time for which application threads were stopped: 0.0001413 seconds, Stopping threads took: 0.0000431 seconds
2016-06-20T13:30:25.915+0000: 14768.632: Total time for which application threads were stopped: 0.0003705 seconds, Stopping threads took: 0.0000994 seconds
 CMS: abort preclean due to time 2016-06-20T13:30:27.501+0000: 14770.218: [CMS-concurrent-abortable-preclean: 2.280/5.017 secs] [Times: user=2.82 sys=0.00, real=5.02 secs]
2016-06-20T13:30:27.501+0000: 14770.218: [GC (CMS Final Remark) [YG occupancy: 236848 K (306688 K)]14770.218: [Rescan (parallel) , 0.0396891 secs]14770.258: [weak refs processing, 0.0015708 secs]14770.260: [class unloading, 0.0378481 secs]14770.297: [scrub symbol table, 0.0048712 secs]14770.302: [scrub string table, 0.0011226 secs][1 CMS-remark: 45998K(1756416K)] 282847K(2063104K), 0.0997761 secs] [Times: user=0.17 sys=0.00, real=0.10 secs]
2016-06-20T13:30:27.601+0000: 14770.318: Total time for which application threads were stopped: 0.1001697 seconds, Stopping threads took: 0.0000782 seconds
2016-06-20T13:30:27.601+0000: 14770.318: [CMS-concurrent-sweep-start]
2016-06-20T13:30:27.629+0000: 14770.346: [CMS-concurrent-sweep: 0.028/0.028 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
2016-06-20T13:30:27.629+0000: 14770.346: [CMS-concurrent-reset-start]
2016-06-20T13:30:27.634+0000: 14770.351: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]   # old generation gc end
2016-06-20T13:30:28.352+0000: 14771.069: Total time for which application threads were stopped: 0.0008748 seconds, Stopping threads took: 0.0000578 seconds
2016-06-20T13:30:29.352+0000: 14772.069: Total time for which application threads were stopped: 0.0003125 seconds, Stopping threads took: 0.0000779 seconds
2016-06-20T13:30:31.987+0000: 14774.704: Total time for which application threads were stopped: 0.0003919 seconds, Stopping threads took: 0.0000630 seconds
2016-06-20T13:30:32.916+0000: 14775.633: Total time for which application threads were stopped: 0.0004499 seconds, Stopping threads took: 0.0002200 seconds
2016-06-20T13:30:32.917+0000: 14775.634: Total time for which application threads were stopped: 0.0001433 seconds, Stopping threads took: 0.0000426 seconds
2016-06-20T13:30:33.849+0000: 14776.566: Total time for which application threads were stopped: 0.0005113 seconds, Stopping threads took: 0.0000774 seconds
2016-06-20T13:30:34.861+0000: 14777.578: Total time for which application threads were stopped: 0.0003934 seconds, Stopping threads took: 0.0000911 seconds
2016-06-20T13:30:35.017+0000: 14777.734: [GC (Allocation Failure) 14777.734: [ParNew     # young generation gc start
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
- age   1:    6642824 bytes,    6642824 total
- age   2:     740016 bytes,    7382840 total
- age   3:     543608 bytes,    7926448 total
- age   4:    1035336 bytes,    8961784 total
- age   5:     977896 bytes,    9939680 total
- age   6:     743032 bytes,   10682712 total
: 282272K->14724K(306688K), 0.0161740 secs] 313056K->46031K(2063104K), 0.0163286 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]    # young generation gc end
2016-06-20T13:30:35.034+0000: 14777.751: Total time for which application threads were stopped: 0.0415288 seconds, Stopping threads took: 0.0247301 seconds
2016-06-20T13:30:36.034+0000: 14778.751: Total time for which application threads were stopped: 0.0004305 seconds, Stopping threads took: 0.0001093 seconds

Old gen GC:

The default GC for Elasticsearch is Concurrent-Mark and Sweep (CMS). the CMS Collector handles objects in the old generation , this GC runs concurrently with the execution of the application so that it can minimize pauses. It does, however, have two stop-the-world phases. It also has trouble collecting large heaps. 

The two phases time should be noticed , if they are too longer , it means your ES load is heavy :

[GC (CMS Initial Mark) [1 CMS-initial-mark: 45998K(1756416K)] 217877K(2063104K), 0.0292245 secs] [Times: user=0.12 sys=0.00, real=0.03 secs] 

[1 CMS-remark: 45998K(1756416K)] 282847K(2063104K), 0.0997761 secs] [Times: user=0.17 sys=0.00, real=0.10 secs]

refer : https://www.elastic.co/guide/en/elasticsearch/guide/current/_don_8217_t_touch_these_settings.html

 

CMS (Concurrent-Mark and Sweep) Primer :

The CMS Collector is designed to run mostly concurrent to the application threads and only cause few (and short) pause times. The downside of running GC concurrently to the application is that various synchronization and data inconsistency issues may arise. In order to achieve safe and correct concurrent execution, a GC cycle of the CMS Collector is divided into a number of consecutive phases.

Phases of the CMS Collector :

A GC cycle of the CMS Collector consists of six phases. Four of the phases (the names of which start with „Concurrent“) are run concurrently to the actual application while the other two phases need to stop the application threads.

  1. Initial Mark: The application threads are paused in order to collect their object references. When this is finished, the application threads are started again.
  2. Concurrent Mark: Starting from the object references collected in phase 1, all other referenced objects are traversed.
  3. Concurrent Preclean: Changes to object references made by the application threads while phase 2 was running are used to update the results from phase 2.
  4. Remark: As phase 3 is concurrent as well, further changes to object references may have happened. Therefore, the application threads are stopped once more to take any such updates into account and ensure a correct view of referenced objects before the actual cleaning takes place. This step is essential because it must be avoided to collect any objects that are still referenced.
  5. Concurrent Sweep: All objects that are not referenced anymore get removed from the heap.
  6. Concurrent Reset: The collector does some housekeeping work so that there is a clean state when the next GC cycle starts.

refer to : https://blog.codecentric.de/en/2013/10/useful-jvm-flags-part-7-cms-collector/

Young gen GC:

It should be noted that, even though the CMS Collector offers a mostly concurrent solution for old generation GCs, young generation GCs are still handled using a stop-the-world approach. The rationale behind this is that young generation GCs are typically short enough so that the resulting pause times are satisfactory even for interactive applications.

No need to pay attention to young gen gc . just one thing to let you aware :

"Allocation Failure" is a cause of GC cycle to kick. "Allocation Failure" means what no more space left in Eden to allocate object. So, it is normal cause of young GC. older JVM were not printing GC cause for minor GC cycles.
"Allocation Failure" is almost only possible cause for minor GC.

refer to : http://stackoverflow.com/questions/28342736/java-gc-allocation-failure

© 著作权归作者所有

Faye_Cai
粉丝 0
博文 28
码字总数 5590
作品 0
海淀
高级程序员
私信 提问
SearchPhaseExecutionException: all shards failed path: /.kibana/doc/_count, params: {index=.kibana, type=doc}

[2019-06-19T17:02:29,804][WARN ][r.suppressed ] [node-1] path: /.kibana/doc/_count, params: {index=.kibana, type=doc} org.elasticsearch.action.search.SearchPhaseExecutionExcepti......

打杂uu
06/19
560
0
如何优化100s的Elasticsearch 查询到1s以内

在SQL的世界里, 查询优化是相当成熟并且是可以理解的, 另外一方面, 分布式数据库系统是新出现的, 并且不太成熟. 理解查询是如何工作的将是一件非常重要的事情. Elstaticsearch的查询有时候将...

leon_lu
2017/10/31
1K
0
Linux安装Elastic Search 小记

前言 趁着夏日的到来,趁着自己还年轻,折腾一下自己的阿里云ECS服务器。 正文 在/usr/local目录下,创建一个elasticsearch的文件夹,并进入这个文件夹 首先我们需要下载Elasticsearch,下载地...

luckyHarry4
05/17
0
0
centos 7( linux )下搭建elasticsearch踩坑记

目录 概述 环境准备 elasticsearch配置 启动踩坑记 彩蛋 概述 公司最近在做全文检索的项目,发现elasticsearch踩了不少坑,百度点进去又是坑,在此记录一下自己的踩坑历程。 本文旨在单机版的e...

java_龙
2018/10/17
491
4
k8s helm efk 问题

[root@master03 ~]# kubectl logs -n efk efk-es-elasticsearch-client-786d9b46c9-cmn2w OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and......

weir2016
09/10
168
2

没有更多内容

加载失败,请刷新页面

加载更多

Handler简解

Handler 这里简化一下代码 以便理解 Handler不一定要在主线程建 但如Handler handler = new Handler(); 会使用当前的Looper的, 由于要更新UI 所以最好在主线程 new Handler() { mLooper = Lo...

shzwork
23分钟前
3
0
h5获取摄像头拍照功能

完整代码展示: <!DOCTYPE html> <head> <title>HTML5 GetUserMedia Demo</title> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum......

诗书易经
26分钟前
3
0
正向代理和反向代理

文章来源 运维公会:正向代理和反向代理 1、正向代理 (1)服务对象不同 正向代理服务器的服务对象是客户端,可以将客户端和代理服务器看作一个整体。 (2)配置方法不同 需要在客户端配置代...

运维团
42分钟前
4
0
5个避免意外论文重复率高的方法

即使你不是故意抄袭,但你可能在无意中抄袭了别人的论文, 这个叫做意外抄袭,它可能正发生在你身上,如果你不熟悉学术 道德规范,这里将告诉你5个基本的方法来避免意外抄袭。 Tip1 熟悉其他...

论文辅导员
44分钟前
4
0
Maven通过profiles标签读取不同的配置

<profiles> <profile> <id>dev</id> <properties> <profiles.active>dev</profiles.active> </properties> ......

时刻在奔跑
49分钟前
2
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部