一次HDFS JN lag延时问题的排查分析后续:RM陡增traffic的来源分析

2021/01/26 07:47
阅读数 122

前言


在上篇文章(一次HDFS JournalNode transaction lag问题分析排查)中,笔者详细地讲述了一次HDFS JN lag延时问题的分析排查,在文中结尾笔者提到了问题的root cause是因为部署在JN同节点上的RM traffic陡增造成的JN lag影响。不过笔者当时只是发现了与RM traffic陡增的可能因素,并未进一步分析这些增加的RM traffic的真正来源。在后续的时间里,我们对这些RM traffic的组成部分以及为什么做了更加进一步的分析,最终真正找到了问题的来源。本文笔者将简单阐述JN lag问题的后续问题:RM traffic陡增的缘由。

背景


首先提供RM traffic陡增的一个趋势图,如下图所示(在上篇JN lag的文章中也提到过)。
在这里插入图片描述
当时的一个推测方向是部分业务方的业务上的调整,导致了此部分流量的变化。为了更加明确我们的这个猜想,后续我们进行了对RM traffic的进一步分析。

iftop命令分析

在Linux的工具命令中,iftop命令能够检测出准实时的节点带宽使用的多少。于是,我们在RM的机器上进行了iftop的执行,命令执行结果如下:

命令:sudo iftop -B -P

rm-node:8031            => nm-node1:42810  47.2KB  94.4KB  94.4KB
                        <=                                                       996B   2.15KB  2.15KB
rm-node:8031            => nm-node2:37624  47.2KB  82.6KB  82.6KB
                        <=                                                      1.33KB  1.95KB  1.95KB
rm-node:8031            => nm-node3:46594   141KB  82.5KB  82.5KB
                        <=                                                      2.96KB  1.97KB  1.97KB
rm-node:8031            => nm-node4:59782  47.2KB  82.6KB  82.6KB
                        <=                                                      1.12KB  1.90KB  1.90KB
rm-node:8031            => nm-node5:41700  47.2KB  82.6KB  82.6KB
                        <=                                                      1.18KB  1.79KB  1.79KB
rm-node:8031            => nm-node6:50672  47.2KB  82.6KB  82.6KB
                        <=                                                       966B   1.79KB  1.79KB
rm-node:8031            => nm-node7:55306  94.4KB  82.6KB  82.6KB
                        <=                                                      1.88KB  1.71KB  1.71KB
rm-node:8031            => nm-node8:51870  47.2KB  82.6KB  82.6KB
                        <=                                                      1.71KB  1.68KB  1.68KB
rm-node:8031           => nm-node9:49730  47.3KB  82.6KB  82.6KB
                        <=                                                       907B   1.62KB  1.62KB

上图数据是其中截取的一个时刻点的RM traffic数据。我们得到了许多十分关键的信息:

  • RM的traffic使用主要集中在8031端口的数据传输上。
  • 在RM 8031口的通信过程中,从RM到NM的数据传输要远大于从NM到RM的方向的数据传输。

再进一步分析RM 8031口的用途,主要用于做RM、NM之间进行heartbeat通信的。我们再转换成上面iftop的语义,NM向RM发送node heartbeat,数据量比较小,但是RM返回给NM的heartbeat response则耗费了同比大很多倍的内容大小。

tcpdump的同期验证


同期,我们挑选了1台NM,用tcpdump命令,抓取在此机器上的实时流量,src和dst分别设置成rm机器进行测试。

结果如下:

NM->RM
命令:tcpdump dst rm-node

20:16:48.028768 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13088574:13092670, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327940], length 4096
20:16:48.028778 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13092670:13096766, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028791 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13096766:13100862, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028800 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13100862:13104958, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028809 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13104958:13109054, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028822 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13109054:13113150, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028832 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13113150:13117246, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028851 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13117246:13121342, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028862 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13121342:13125438, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028875 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13125438:13129534, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028886 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13129534:13130257, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 723
20:16:48.037348 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4234513, win 2889, options [nop,nop,TS val 561229068 ecr 498327949], length 0
20:16:48.037407 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4238857, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037453 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4241257, win 2885, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037461 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4242705, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037499 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4247049, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037627 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4248497, win 2889, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037679 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4252841, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037729 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4258633, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0

RM->NM
命令:tcpdump src rm-node

20:49:59.617672 IP rm-node.8030 > nm-node.58940: Flags [P.], seq 2156:2854, ack 1746598, win 5955, options [nop,nop,TS val 500319529 ecr 2277606464], length 698
20:49:59.672596 IP rm-node.8031 > nm-node.58058: Flags [.], ack 3590, win 901, options [nop,nop,TS val 500319583 ecr 2277606545], length 0
20:49:59.674288 IP rm-node.8031 > nm-node.58058: Flags [.], seq 288621:290069, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674336 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 290069:296813, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 6744
20:49:59.674337 IP rm-node.8031 > nm-node.58058: Flags [.], seq 296813:301157, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 4344
20:49:59.674380 IP rm-node.8031 > nm-node.58058: Flags [.], seq 301157:302605, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674588 IP rm-node.8031 > nm-node.58058: Flags [.], seq 302605:304053, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 1448
20:49:59.674641 IP rm-node.8031 > nm-node.58058: Flags [.], seq 304053:319981, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 15928
20:49:59.674689 IP rm-node.8031 > nm-node.58058: Flags [.], seq 319981:331565, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 11584
20:49:59.674822 IP rm-node.8031 > nm-node.58058: Flags [.], seq 331565:333013, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 1448
20:49:59.674881 IP rm-node.8031 > nm-node.58058: Flags [.], seq 333013:366317, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 33304
20:49:59.674923 IP rm-node.8031 > nm-node.58058: Flags [.], seq 366317:376453, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 10136
20:49:59.674974 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 376453:384828, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 8375
20:50:00.015461 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1750751, win 5955, options [nop,nop,TS val 500319926 ecr 2277606868], length 0
20:50:00.015503 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1754847, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015505 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1767135, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015550 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1822079, win 5806, options [nop,nop,TS val 500319927 ecr 2277606888], length 0

从上述结果能够看出:

  • 从NM到RM的发送的数据packet,length比较统一,而且都比较小。
  • 从RM返回给NM的数据packet,则是length并不统一,而且length都比较大。

测试结果也是吻合了上节的iftop命令的分析结果。

RM NodeHeartbeat Response的具体分析


上面命令分析的结果把RM traffic的来源缩小到了NodeHeartbeat Response的对象里了,但是问题是我们还不知道是里面的什么东西造成如此巨大的response的。iftop和tcpdump都无法抓取如此细节的信息。那么这个时候,只能从代码角度进行剖析了。

以下是Heartbeat response的类定义:

public interface NodeHeartbeatResponse {
   
       
  int getResponseId();
  NodeAction getNodeAction();

  List<ContainerId> getContainersToCleanup();
  List<ContainerId> getContainersToBeRemovedFromNM();

  List<ApplicationId> getApplicationsToCleanup();

  void setResponseId(int responseId);
  void setNodeAction(NodeAction action);

  MasterKey getContainerTokenMasterKey();
  void setContainerTokenMasterKey(MasterKey secretKey);
  
  MasterKey getNMTokenMasterKey();
  void setNMTokenMasterKey(MasterKey secretKey);

  void addAllContainersToCleanup(List<ContainerId> containers);

  // This tells NM to remove finished containers from its context. Currently, NM
  // will remove finished containers from its context only after AM has actually
  // received the finished containers in a previous allocate response
  void addContainersToBeRemovedFromNM(List<ContainerId> containers);
  
  void addAllApplicationsToCleanup(List<ApplicationId> applications);

  long getNextHeartBeatInterval();
  void setNextHeartBeatInterval(long nextHeartBeatInterval);
  
  String getDiagnosticsMessage();

  void setDiagnosticsMessage(String diagnosticsMessage);

  // Credentials (i.e. hdfs tokens) needed by NodeManagers for application
  // localizations and logAggreations.
  Map<ApplicationId, ByteBuffer> getSystemCredentialsForApps();  

  void setSystemCredentialsForApps(
      Map<ApplicationId, ByteBuffer> systemCredentials);  <====
  
  boolean getAreNodeLabelsAcceptedByRM();
  void setAreNodeLabelsAcceptedByRM(boolean areNodeLabelsAcceptedByRM);
}

经过快速的对NodeHeartbeatResponse的浏览,其中有个叫SystemCredentialsForApps的变量有着比其它诸如ContainerId和ApplicationId更为复杂的映射存储。按照上面注释的解释,这些credentials(即hdfs token)是RM返回给NM用来做application的localizations and logAggreations的。这里其实透漏着一个隐含的条件关系:如果NM上跑的application数量上去了,那么则对应其接受到RM的这些credentials也会跟着变多。

这似乎能够解释的通上文提到的业务方业务上的调整导致了RM traffic的升高。这里面的因果关系如下:

1)某业务方进行业务调整,导致提交了很多新的application到RM上来。
2)鉴于这些新application的lolization过程和log aggregation操作需要有对应的token,RM在nodeHeartbeat response里给NM返回了更多credentials信息。
3)因为这些credentials信息增多导致的nodeHeartbeat response变大,进而导致RM机器上traffic的上涨。

RM NodeHeartbeat大response的解决


后续我们在社区上寻找因为credentials导致的NodeHeartbeat response数据过大的解决方案,最终我们找到了对应此问题的一个JIRA改进 YARN-6523(Optimize system credentials sent in node heartbeat responses)。此JIRA提到的问题的场景与我们生产环境遇到的极为类似。

YARN-6523的解决策略是在node hearbeat里新增了sequence number做token更新状态的标记,只有当NM端需要有delta的token更新时,RM才会将credentials设置到NM的response里面去。之后NM如果没有需要额外token的需要时,RM将不会发送重复的token信息给NM端。我们目前也在积极的尝试此套解决方案,期待能达到预期良好的优化效果。

展开阅读全文
加载中
点击引领话题📣 发布并加入讨论🔥
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部