RabbitMQ服务检查
RabbitMQ服务检查
Xuanyuan92 发表于2个月前
RabbitMQ服务检查
  • 发表于 2个月前
  • 阅读 9
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云实验室 1小时搭建人工智能应用,让技术更容易入门 免费体验 >>>   

RabbitMQ服务检查

最近刚好工作上需要整理一下材料,所以把RabbitMQ的服务检查做了个简单的文档出来,留作备份也算。

1、各RabbitMQ节点上基本服务状态检查

登录到各个RabbitMQ节点上,执行

rabbitmqctl status

正常状态如下:

# Status of node rabbit@devxyz ...
# [{pid,13505},
#  {running_applications,
#      [{rabbitmq_management,"RabbitMQ Management Console","3.6.5"},
#       {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.5"},
#       {rabbit,"RabbitMQ","3.6.5"},
#       {os_mon,"CPO  CXC 138 46","2.4"},
#       {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.5"},
#       {webmachine,"webmachine","1.10.3"},
#       {mochiweb,"MochiMedia Web Server","2.13.1"},
#       {amqp_client,"RabbitMQ AMQP Client","3.6.5"},
#       {rabbit_common,[],"3.6.5"},
#       {mnesia,"MNESIA  CXC 138 12","4.13.4"},
#       {compiler,"ERTS  CXC 138 10","6.0.3"},
#       {ssl,"Erlang/OTP SSL application","7.3.3.1"},
#       {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"},
#       {public_key,"Public key infrastructure","1.1.1"},
#       {xmerl,"XML parser","1.3.10"},
#       {inets,"INETS  CXC 138 49","6.2.4"},
#       {asn1,"The Erlang ASN1 compiler version 4.0.2","4.0.2"},
#       {crypto,"CRYPTO","3.6.3"},
#       {syntax_tools,"Syntax tools","1.7"},
#       {sasl,"SASL  CXC 138 11","2.7"},
#       {stdlib,"ERTS  CXC 138 10","2.8"},
#       {kernel,"ERTS  CXC 138 10","4.2"}]},
#  {os,{unix,linux}},
#  {erlang_version,
#      "Erlang/OTP 18 [erts-7.3.1.2] [source] [64-bit] [smp:8:8] [async-threads:128] [hipe] [kernel-poll:true]\n"},
#  {memory,
#      [{total,119288000},
#       {connection_readers,491304},
#       {connection_writers,33944},
#       {connection_channels,115312},
#       {connection_other,563312},
#       {queue_procs,510368},
#       {queue_slave_procs,0},
#       {plugins,1254560},
#       {other_proc,18328184},
#       {mnesia,160320},
#       {mgmt_db,2527968},
#       {msg_index,66840},
#       {other_ets,1641160},
#       {binary,55247472},
#       {code,27655723},
#       {atom,992409},
#       {other_system,9699124}]},
#  {alarms,[]},
#  {listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},
#  {vm_memory_high_watermark,0.4},
#  {vm_memory_limit,6663295795},
#  {disk_free_limit,50000000},
#  {disk_free,53003800576},
#  {file_descriptors,
#      [{total_limit,1948},
#       {total_used,23},
#       {sockets_limit,1751},
#       {sockets_used,21}]},
#  {processes,[{limit,1048576},{used,498}]},
#  {run_queue,0},
#  {uptime,47953},
#  {kernel,{net_ticktime,60}}]


得到rabbitmq服务的状态,得到的结果中显示服务正在running,且不存在nodedown、error等字样,并且
running_applications中包含了rabbitmq_management等应用名(如果开启了rabbitmq_management等插件)

1.1、如果running状态,但是没有rabbitmq_management字样,类似如下结果:

# Status of node rabbit@devxyz ...
# [{pid,13505},
#  {running_applications,[{compiler,"ERTS  CXC 138 10","6.0.3"},
#                         {ssl,"Erlang/OTP SSL application","7.3.3.1"},
#                         {ranch,"Socket acceptor pool for TCP protocols.",
#                                "1.2.1"},
#                         {public_key,"Public key infrastructure","1.1.1"},
#                         {xmerl,"XML parser","1.3.10"},
#                         {inets,"INETS  CXC 138 49","6.2.4"},
#                         {asn1,"The Erlang ASN1 compiler version 4.0.2",
#                               "4.0.2"},
#                         {crypto,"CRYPTO","3.6.3"},
#                         {syntax_tools,"Syntax tools","1.7"},
#                         {sasl,"SASL  CXC 138 11","2.7"},
#                         {stdlib,"ERTS  CXC 138 10","2.8"},
#                         {kernel,"ERTS  CXC 138 10","4.2"}]},
#  {os,{unix,linux}},
#  {erlang_version,"Erlang/OTP 18 [erts-7.3.1.2] [source] [64-bit] [smp:8:8] [async-threads:128] [hipe] [kernel-poll:true]\n"},
#  {memory,[{total,58267544},
#           {connection_readers,0},
#           {connection_writers,0},
#           {connection_channels,0},
#           {connection_other,0},
#           {queue_procs,0},
#           {queue_slave_procs,0},
#           {plugins,0},
#           {other_proc,18771312},
#           {mnesia,0},
#           {mgmt_db,0},
#           {msg_index,0},
#           {other_ets,1218464},
#           {binary,29984},
#           {code,27655723},
#           {atom,992409},
#           {other_system,9599652}]},
#  {alarms,[]},
#  {listeners,[]},
#  {processes,[{limit,1048576},{used,73}]},
#  {run_queue,0},
#  {uptime,48363},
#  {kernel,{net_ticktime,60}}]


则说明rabbitmq应用没有启动,只启动了基础服务,则执行
rabbitmqctl start_app
得到:
# Starting node rabbit@devxyz ...

然后
rabbitmqctl status 再次验证服务状态

1.2、如果存在error,例如:

# Status of node rabbit@devxyz ...
# Error: unable to connect to node rabbit@devxyz: nodedown
# 
# DIAGNOSTICS
# ===========
# 
# attempted to contact: [rabbit@devxyz]
# 
# rabbit@devxyz:
#   * connected to epmd (port 4369) on devxyz
#   * epmd reports: node 'rabbit' not running at all
#                   no other nodes on devxyz
#   * suggestion: start the node
# 
# current node details:
# - node name: 'rabbitmq-cli-07@devxyz'
# - home dir: /var/lib/rabbitmq
# - cookie hash: duuNopvOx1ChRdjrRHPo+A==

说明rabbitmq的基础服务都没有启动起来,首先尝试如下命令看是否可以启动:

rabbitmq-server -detached
得到:
# Warning: PID file not written; -detached was passed.

rabbitmqctl start_app
得到
# Starting node rabbit@devxyz ...

rabbitmqctl status验证
如果无法得到正常状态,则需要根据报错信息进行判断再进行相应操作

2、检查RabbitMQ集群的状态

登录到任意一个存活的RabbitMQ节点上,执行

rabbitmqctl cluster_status
得到:
# Cluster status of node rabbit@HYRBT001 ...
# [{nodes,[{disc,[rabbit@HYRBT001,rabbit@HYRBT002,rabbit@HYRBT003]}]},
#  {running_nodes,[rabbit@HYRBT003,rabbit@HYRBT002,rabbit@HYRBT001]},
#  {cluster_name,<<"HYRBT001">>},
#  {partitions,[]},
#  {alarms,[{rabbit@HYRBT003,[]},{rabbit@HYRBT002,[]},{rabbit@HYRBT001,[]}]}]

得到集群的状态信息
nodes: 后面会显示所有的rabbitmq节点
running_nodes: 后面会显示所有的rabbitmq节点
cluster_name:后面会显示集群名称
partitions之后为空
alarms之后跟的节点之后的[]中为空


2.1、如果nodes后面的rabbitmq节点不全,说明存在节点没有加入到集群中

例如:
# Cluster status of node rabbit@HYCTL001 ...
# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002]}]},
#  {running_nodes,[rabbit@HYCTL002,rabbit@HYCTL001]},
#  {cluster_name,<<"rabbit@HYCTL001">>},
#  {partitions,[]},
#  {alarms,[{rabbit@HYCTL002,[]},{rabbit@HYCTL001,[]}]}]

但实际上应该有三个节点上,则登录到未加入到集群中的节点3上

首先验证此节点与已经加入集群的节点的连通性,通过ping测试
然后验证.erlang.cookie是否相同
.erlang.cookie位于/var/lib/rabbitmq/下
如果不同,则将集群中节点的内容复制到此节点上

验证都通过后,查看rabbitmq服务是否已经开启,具体见步骤1

服务正常之后,执行如下命令加入集群:

rabbitmqctl stop_app
得到:
# Stopping node rabbit@HYCTL003 ...

rabbitmqctl reset
得到:
# Resetting node rabbit@HYCTL003 ...

rabbitmqctl join_cluster rabbit@集群节点名
得到:
# Clustering node rabbit@HYCTL003 with rabbit@HYCTL001 ...

rabbitmqctl start_app
得到:
# Starting node rabbit@HYCTL003 ...

rabbitmqctl cluster_status验证节点已经加入到nodes、running_nodes及alarms之后
得到:
# Cluster status of node rabbit@HYCTL003 ...
# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]}]},
#  {running_nodes,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]},
#  {cluster_name,<<"rabbit@HYCTL001">>},
#  {partitions,[]},
#  {alarms,[{rabbit@HYCTL001,[]},{rabbit@HYCTL002,[]},{rabbit@HYCTL003,[]}]}]

2.2、如果running_nodes之后未显示所有的节点,说明部分节点上的rabbitmq服务未正常,例如:

# Cluster status of node rabbit@HYCTL001 ...
# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]}]},
#  {running_nodes,[rabbit@HYCTL002,rabbit@HYCTL001]},
#  {cluster_name,<<"rabbit@HYCTL001">>},
#  {partitions,[]},
#  {alarms,[{rabbit@HYCTL002,[]},{rabbit@HYCTL001,[]}]}]

发现节点3没有running,则登录到节点3
参考步骤1进行处理,处理完成后,执行
rabbitmqctl cluster_status验证

# Cluster status of node rabbit@HYCTL003 ...
# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]}]},
#  {running_nodes,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]},
#  {cluster_name,<<"rabbit@HYCTL001">>},
#  {partitions,[]},
#  {alarms,[{rabbit@HYCTL001,[]},{rabbit@HYCTL002,[]},{rabbit@HYCTL003,[]}]}]

2.3、如果partitions中存在节点,则说明发生了脑裂(一般为网络问题,导致节点之间通信异常),集群服务处于异常状态。
例如:
# Cluster status of node rabbit@HYCTL001 ...
# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL00]}]},
#  {running_nodes,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]},
#  {cluster_name,<<"rabbit@HYCTL001">>},
#  {partitions,[{rabbit@HYCTL001,rabbit@HYCTL002,[rabbit@HYCTL001]},
#                {rabbit@HYCTL003,[rabbit@HYCTL001,rabbit@HYCTL002]}]},
#  {alarms,[{rabbit@HYCTL001,[]},{rabbit@HYCTL002,[]},{rabbit@HYCTL003,[]}]}]

需要确定一个主节点进行保留,然后把另外partition中节点进行服务重启。


主节点的确定主要分两种情况:
	2.3.1、如果使用了haproxy来对rabbitmq集群进行负载均衡,并且设置了主备模式,则可以通过查看haproxy的配置
	文件来确定:
	登录到某一台控制节点,查看haproxy配置文件:
	cat /etc/haproxy/conf.d/100-rabbitmq.cfg
	得到:

	# listen rabbitmq
	#   bind 192.168.0.10:5672
	#   balance  roundrobin
	#   mode  tcp
	#   option  tcpka
	#   timeout client  48h
	#   timeout server  48h
	#   server HYCTL001 192.168.0.11:5673   check inter 5000 rise 2 fall 3
	#   server HYCTL002 192.168.0.12:5673  backup check inter 5000 rise 2 fall 3
	#   server HYCTL003 192.168.0.13:5673  backup check inter 5000 rise 2 fall 3

	配置文件中存在backup的是备节点,无backup的是主节点,由此可见,对于本环境,HYCTL001为主节点,处理业务

	确定好主节点之后,登录到其他非主节点的rabbitmq节点进行rabbitmq服务的重启

	执行如下命令:
	rabbitmqctl stop
	得到:
	# Stopping and halting node rabbit@HYCTL003 ...

	rabbitmq-server -detached
	得到:
	# Warning: PID file not written; -detached was passed.

	rabbitmqctl start_app
	得到:
	# Starting node rabbit@HYCTL003 ...

	使用rabbitmqctl status检查状态
	使用rabbitmqctl cluster_status检查集群状态,如果依然存在其他脑裂的节点,则partitions中主节点所在元组
	会增加刚刚重启的节点,其他元组中该节点被移除。如果所有的脑裂节点都已经处理完毕,则partitions后无节点存在,
	得到:
	# Cluster status of node rabbit@HYCTL003 ...
	# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]}]},
	#  {running_nodes,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]},
	#  {cluster_name,<<"rabbit@HYCTL001">>},
	#  {partitions,[]},
	#  {alarms,[{rabbit@HYCTL001,[]},{rabbit@HYCTL002,[]},{rabbit@HYCTL003,[]}]}]

	2.3.2 如果没有设置主备模式,则需要确定下当前连接数最多的节点,以此节点为主
	通过查看连接数来进行判断,在任意一个RabbitMQ节点上执行:
	rabbitmqctl list_connections pid | grep HYCTL001(节点名) | wc -l
	
	对所有的节点名进行连接数个数的选取,最终选择连接数目最多的那个partition元组作为主元组,对其他元组中的节点
	进行RabbitMQ服务的重启,重启步骤与2.3.1相同

2.4、如果alarms中存在节点,说明内存或者磁盘占用过多,例如:

	# Cluster status of node rabbit@HYCTL003 ...
	# [{nodes,[{disc,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]}]},
	#  {running_nodes,[rabbit@HYCTL001,rabbit@HYCTL002,rabbit@HYCTL003]},
	#  {cluster_name,<<"rabbit@HYCTL001">>},
	#  {partitions,[]},
	#  {alarms,[{rabbit@HYCTL001,[]},
    #   {rabbit@HYCTL002,[]},
    #   {rabbit@HYCTL003,[disk,memory]}]}]

	说明节点3上内存和磁盘都出现了报警,说明有大量的消息堆积在了节点3上,原因可能是后端消费消息的服务异常或者存在无效的队列
	一直在接收消息,但是并没有消费者进行消费。
	RabbitMQ报警的参数是可以设置的,具体的值通过rabbitmqctl status可以看到,如下:

	# {vm_memory_high_watermark,0.4},内存使用阈值
	# {vm_memory_limit,81016840192},内存使用限值
	# {disk_free_limit,50000000},空闲磁盘限值
	# {disk_free,553529729024},磁盘空余量
	# {file_descriptors,
	#     [{total_limit,102300},
	#      {total_used,2040},
	#      {sockets_limit,92068},
	#      {sockets_used,2038}]},文件描述符和socket的使用及阈值
	# {processes,[{limit,1048576},{used,31681}]},进程数使用及阈值


	消息堆积数目的确定可以通过如下命令:
	rabbitmqctl list_queues messages_ready | awk 'NR>=2{print }'| awk '{sum+=$1}END{print sum}'
	得到最终消息堆积数目

	rabbitmqctl list_queues message_bytes_ram | awk 'NR>=2{print }'| awk '{sum+=$1}END{print sum}'
	可以得到消息堆积占用的内存

	rabbitmqctl list_queues message_bytes_persistent | awk 'NR>=2{print }'| awk '{sum+=$1}END{print sum}'
	可以得到消息堆积占用的磁盘

	消息堆积时需要先检查是那些队列堆积消息过多
	rabbitmqctl list_queues message_bytes_ram name | awk 'NR>=2{print }'|sort -rn|less
	得到消息堆积数目的从大到小的排序,并显示队列名称,然后根据队列名进行不同节点服务的排查,如果是服务状态异常,则
	对服务进行处理,如果是无效队列(前期使用当前已经不再使用的服务产生的队列),则进行删除,队列的删除需要登录到
	RabbitMQ的管理页面上进行处理,后面会写管理页面的操作。

2.5、检查RabbitMQ的队列或者连接是否处于流控状态

	当RabbitMQ的消费者端的处理能力远低于消息的生产速度时,RabbitMQ会自动进行流控,避免消息过度堆积且导致消息从
	产生到被消费时间间隔过长。
	是否发生了流控可以通过命令行查看,登录到任意一个RabbitMQ节点,执行

	rabbitmqctl list_queus name state | grep flow
	如果得到结果,说明对应的队列产生了流控,需要对队列的生产进程和消费进程进行检查,参考2.4

	rabbitmqctl list_connections name state|grep flow
	如果得到结果,说明对应的连接产生了流控,此时队列中也一定会有流控,对队列的生产进程和消费进程进行检查,参考2.4

3、RabbitMQ管理页面开启

RabbitMQ管理页面的开启需要先启用rabbitmq_management插件
登录到任意一台RabbitMQ节点,首先查看是否启用了rabbitmq_management插件:

rabbitmq-plugins list -v -E  |grep -A5 rabbitmq_management
得到:
# [E*] rabbitmq_management
#      Version:     	3.6.5
#      Dependencies:	[rabbitmq_web_dispatch,amqp_client,
#                          rabbitmq_management_agent]
#      Description: 	RabbitMQ Management Console

说明rabbitmq_management插件已经启用

如果未启用,则通过如下命令开启:

rabbitmq-plugins enable rabbitmq_management
得到:
# The following plugins have been enabled:
#   mochiweb
#   webmachine
#   rabbitmq_web_dispatch
#   amqp_client
#   rabbitmq_management_agent
#   rabbitmq_management
# 
# Applying plugin configuration to rabbit@devxyz... started 6 plugins.

rabbitmq_management插件启用以后,需要开通15672端口的防火墙规则,rabbitmq_management插件默认使用15672端口
进行访问

iptables -I INPUT -p tcp --dport 15672 -j ACCEPT
service iptables save
开启iptables规则并保存

然后使用此节点的ip:15672登录到管理页面
输入用户名密码

用户名可以通过
rabbitmqctl list_users来获取,对应的密码为之前用户设置的密码,使用非guest用户登录

登录过后可以看到RabbitMQ整个集群的状态,各节点的状态,是否存在脑裂,是否存在报警,当前的消息堆积数目等,
如果需要对队列进行删除,需要点击Queues标签,然后再Filter后输入队列名,点击进入队列,拉到页面下方,点击
Delete/purge栏,点击Delete可删除队列,purge可清空队列
共有 人打赏支持
粉丝 0
博文 1
码字总数 2853
×
Xuanyuan92
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: