【原创】记录几个最近遇到的未解问题(resolved)
【原创】记录几个最近遇到的未解问题(resolved)
摩云飞 发表于1年前
【原创】记录几个最近遇到的未解问题(resolved)
  • 发表于 1年前
  • 阅读 650
  • 收藏 1
  • 点赞 1
  • 评论 5

腾讯云实验室 1小时搭建人工智能应用,让技术更容易入门 免费体验 >>>   

摘要: 最近遇到的未解问题增多了,需要时间消化下~~

问题一:ejabberd 持续 crashdump

[root@upucore_105 logs]# ls *.dump
erl_crash_20160420-023053.dump  erl_crash_20160420-142431.dump  erl_crash_20160421-012212.dump  erl_crash_20160421-051153.dump  erl_crash_20160421-070845.dump  erl_crash_20160421-122015.dump
erl_crash_20160420-024120.dump  erl_crash_20160420-142731.dump  erl_crash_20160421-012310.dump  erl_crash_20160421-052453.dump  erl_crash_20160421-071828.dump  erl_crash_20160421-123153.dump
erl_crash_20160420-024331.dump  erl_crash_20160420-143215.dump  erl_crash_20160421-013453.dump  erl_crash_20160421-052753.dump  erl_crash_20160421-074435.dump  erl_crash_20160421-124029.dump
erl_crash_20160420-024823.dump  erl_crash_20160420-143324.dump  erl_crash_20160421-013552.dump  erl_crash_20160421-053627.dump  erl_crash_20160421-081136.dump  erl_crash_20160421-131853.dump
erl_crash_20160420-032253.dump  erl_crash_20160420-150153.dump  erl_crash_20160421-014439.dump  erl_crash_20160421-054022.dump  erl_crash_20160421-084953.dump  erl_crash_20160421-132339.dump
erl_crash_20160420-034503.dump  erl_crash_20160420-160153.dump  erl_crash_20160421-014537.dump  erl_crash_20160421-054153.dump  erl_crash_20160421-085503.dump  erl_crash_20160421-134852.dump
erl_crash_20160420-040853.dump  erl_crash_20160420-161253.dump  erl_crash_20160421-014636.dump  erl_crash_20160421-055355.dump  erl_crash_20160421-085701.dump  erl_crash_20160421-140313.dump
erl_crash_20160420-041253.dump  erl_crash_20160420-163453.dump  erl_crash_20160421-014953.dump  erl_crash_20160421-060534.dump  erl_crash_20160421-085953.dump  erl_crash_20160421-141056.dump
erl_crash_20160420-050153.dump  erl_crash_20160420-174102.dump  erl_crash_20160421-020359.dump  erl_crash_20160421-060753.dump  erl_crash_20160421-090453.dump  erl_crash_20160421-141453.dump
erl_crash_20160420-074907.dump  erl_crash_20160420-180053.dump  erl_crash_20160421-021247.dump  erl_crash_20160421-061418.dump  erl_crash_20160421-090838.dump  erl_crash_20160421-151653.dump
erl_crash_20160420-080959.dump  erl_crash_20160420-184253.dump  erl_crash_20160421-022715.dump  erl_crash_20160421-061714.dump  erl_crash_20160421-092406.dump  erl_crash_20160421-152315.dump
erl_crash_20160420-085042.dump  erl_crash_20160420-191953.dump  erl_crash_20160421-023503.dump  erl_crash_20160421-062011.dump  erl_crash_20160421-092756.dump  erl_crash_20160421-153753.dump
erl_crash_20160420-091353.dump  erl_crash_20160420-194153.dump  erl_crash_20160421-024154.dump  erl_crash_20160421-062307.dump  erl_crash_20160421-092953.dump  erl_crash_20160421-154453.dump
erl_crash_20160420-093153.dump  erl_crash_20160420-223553.dump  erl_crash_20160421-025234.dump  erl_crash_20160421-062953.dump  erl_crash_20160421-094516.dump  erl_crash_20160421-160301.dump
erl_crash_20160420-102753.dump  erl_crash_20160420-231953.dump  erl_crash_20160421-025332.dump  erl_crash_20160421-063853.dump  erl_crash_20160421-102153.dump  erl_crash_20160421-160653.dump
erl_crash_20160420-103453.dump  erl_crash_20160421-003149.dump  erl_crash_20160421-031945.dump  erl_crash_20160421-064523.dump  erl_crash_20160421-103326.dump  erl_crash_20160421-171823.dump
erl_crash_20160420-104653.dump  erl_crash_20160421-003253.dump  erl_crash_20160421-033617.dump  erl_crash_20160421-064820.dump  erl_crash_20160421-111837.dump  erl_crash_20160421-173053.dump
erl_crash_20160420-112753.dump  erl_crash_20160421-004228.dump  erl_crash_20160421-041015.dump  erl_crash_20160421-065116.dump  erl_crash_20160421-112953.dump  erl_crash_20160421-173953.dump
erl_crash_20160420-115008.dump  erl_crash_20160421-005857.dump  erl_crash_20160421-042247.dump  erl_crash_20160421-065803.dump  erl_crash_20160421-115902.dump  erl_crash_20160421-180145.dump
erl_crash_20160420-134303.dump  erl_crash_20160421-005956.dump  erl_crash_20160421-042642.dump  erl_crash_20160421-070059.dump  erl_crash_20160421-120853.dump  erl_crash_20160421-181715.dump
erl_crash_20160420-140853.dump  erl_crash_20160421-010153.dump  erl_crash_20160421-044117.dump  erl_crash_20160421-070356.dump  erl_crash_20160421-121233.dump
[root@upucore_105 logs]# 
[root@upucore_105 logs]# ls *.dump|wc -l
125
[root@upucore_105 logs]# 
[root@upucore_105 logs]# date
Thu Apr 21 18:21:54 CST 2016
[root@upucore_105 logs]# 
[root@upucore_105 logs]# for I in *.dump; do grep "Slogan" $I; echo "----"; done     
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
----
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
----
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
----
...
----
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
----
[root@upucore_105 logs]# 

[root@upucore_105 logs]# ps -A -o args,stime,etime |grep ejabberd
/usr/local/mo_ejabberd/bin/ Apr20  1-15:42:20
...
可以看到,ejabberd 是 4 月 20 日启动的,持续运行了一天多,生成了 125 个 crashdump 文件,但 ejabberd 进程还在。
除了上述错误信息外,之前还看到下面这种
Slogan: init terminating in do_boot ()
结论:  可以参考 erlang 手册中关于 erl_crash.dump 的相关说明,截图如下:

怀疑运行环境中存在版本不一致问题。

问题二:redis 服务被不断 shutdown

_._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 2.8.18 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in stand alone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 204410
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

[204410] 19 Apr 18:27:35.131 # Server started, Redis version 2.8.18
[204410] 19 Apr 18:27:35.132 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[204410] 19 Apr 18:27:35.132 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[204410] 19 Apr 18:27:35.132 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[204410] 19 Apr 18:27:35.161 - Accepted 127.0.0.1:16364
[204410] 19 Apr 18:27:35.166 * DB loaded from disk: 0.034 seconds
[204410] 19 Apr 18:27:35.166 * The server is now ready to accept connections on port 6379
[204410] 19 Apr 18:27:35.166 - Client closed connection
[204410] 19 Apr 18:27:35.166 - DB 0: 13573 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:27:35.167 - 0 clients connected (0 slaves), 4797704 bytes in use
[204410] 19 Apr 18:27:35.223 - Accepted 172.16.186.205:17311
[204410] 19 Apr 18:27:36.078 - Accepted 172.16.186.203:16992
[204410] 19 Apr 18:27:36.078 * Slave 172.16.186.203:6379 asks for synchronization
[204410] 19 Apr 18:27:36.079 * Partial resynchronization not accepted: Runid mismatch (Client asked for 'f88ffa4c476425b22a0c1b56932937669b795c0f', I'm '0c4731011b0b911b000c1d70fdc3f907f76ce180')
[204410] 19 Apr 18:27:36.079 * Starting BGSAVE for SYNC with target: disk
[204410] 19 Apr 18:27:36.080 * Background saving started by pid 204415
[204415] 19 Apr 18:27:36.137 * DB saved on disk
[204415] 19 Apr 18:27:36.137 * RDB: 10 MB of memory used by copy-on-write
[204410] 19 Apr 18:27:36.168 * Background saving terminated with success
[204410] 19 Apr 18:27:36.190 * Synchronization with slave 172.16.186.203:6379 succeeded
[204410] 19 Apr 18:27:38.167 - Accepted 127.0.0.1:16418
[204410] 19 Apr 18:27:38.168 - Client closed connection
[204410] 19 Apr 18:27:38.172 - Accepted 172.16.186.205:10391
[204410] 19 Apr 18:27:38.172 - Client closed connection
[204410] 19 Apr 18:27:40.173 - DB 0: 13573 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:27:40.173 - 1 clients connected (1 slaves), 5924936 bytes in use
[204410] 19 Apr 18:27:45.180 - DB 0: 13573 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:27:45.180 - 1 clients connected (1 slaves), 5924648 bytes in use
[204410] 19 Apr 18:27:50.190 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:27:50.191 - 1 clients connected (1 slaves), 5925024 bytes in use
[204410] 19 Apr 18:27:55.199 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:27:55.199 - 1 clients connected (1 slaves), 5926848 bytes in use
[204410] 19 Apr 18:27:58.194 - Client closed connection
[204410] 19 Apr 18:27:58.194 # Connection with slave 172.16.186.203:6379 lost.
[204410] 19 Apr 18:28:00.208 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:00.208 - 1 clients connected (0 slaves), 5870736 bytes in use
[204410] 19 Apr 18:28:02.248 - Accepted 172.16.186.203:17514
[204410] 19 Apr 18:28:02.248 * Slave 172.16.186.203:6379 asks for synchronization
[204410] 19 Apr 18:28:02.248 * Full resync requested by slave 172.16.186.203:6379
[204410] 19 Apr 18:28:02.248 * Starting BGSAVE for SYNC with target: disk
[204410] 19 Apr 18:28:02.250 * Background saving started by pid 205002
[205002] 19 Apr 18:28:02.307 * DB saved on disk
[205002] 19 Apr 18:28:02.308 * RDB: 12 MB of memory used by copy-on-write
[204410] 19 Apr 18:28:02.311 * Background saving terminated with success
[204410] 19 Apr 18:28:02.332 * Synchronization with slave 172.16.186.203:6379 succeeded
[204410] 19 Apr 18:28:05.216 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:05.216 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:10.225 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:10.225 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:15.233 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:15.233 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:20.239 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:20.239 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:25.246 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:25.246 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:30.254 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[204410] 19 Apr 18:28:30.254 - 1 clients connected (1 slaves), 5891664 bytes in use
[204410] 19 Apr 18:28:33.507 - Accepted 127.0.0.1:17448
[204410] 19 Apr 18:28:33.507 # User requested shutdown...
[204410] 19 Apr 18:28:33.508 * Saving the final RDB snapshot before exiting.
[204410] 19 Apr 18:28:33.568 * DB saved on disk
[204410] 19 Apr 18:28:33.568 * Removing the pid file.
[204410] 19 Apr 18:28:33.568 # Redis is now ready to exit, bye bye...
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 2.8.18 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in stand alone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 206040
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

[206040] 19 Apr 18:28:33.580 # Server started, Redis version 2.8.18
[206040] 19 Apr 18:28:33.580 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[206040] 19 Apr 18:28:33.580 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[206040] 19 Apr 18:28:33.580 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[206040] 19 Apr 18:28:33.605 - Accepted 127.0.0.1:17454
[206040] 19 Apr 18:28:33.610 * DB loaded from disk: 0.030 seconds
[206040] 19 Apr 18:28:33.610 * The server is now ready to accept connections on port 6379
[206040] 19 Apr 18:28:33.610 - Client closed connection
[206040] 19 Apr 18:28:33.610 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:33.611 - 0 clients connected (0 slaves), 4798104 bytes in use
[206040] 19 Apr 18:28:33.671 - Accepted 172.16.186.205:16672
[206040] 19 Apr 18:28:34.341 - Accepted 172.16.186.203:18122
[206040] 19 Apr 18:28:34.342 * Slave 172.16.186.203:6379 asks for synchronization
[206040] 19 Apr 18:28:34.342 * Partial resynchronization not accepted: Runid mismatch (Client asked for '0c4731011b0b911b000c1d70fdc3f907f76ce180', I'm '2ae45a0020a36a175b290e23f04672c54fc7fdef')
[206040] 19 Apr 18:28:34.342 * Starting BGSAVE for SYNC with target: disk
[206040] 19 Apr 18:28:34.344 * Background saving started by pid 206049
[206049] 19 Apr 18:28:34.397 * DB saved on disk
[206049] 19 Apr 18:28:34.398 * RDB: 10 MB of memory used by copy-on-write
[206040] 19 Apr 18:28:34.413 * Background saving terminated with success
[206040] 19 Apr 18:28:34.436 * Synchronization with slave 172.16.186.203:6379 succeeded
[206040] 19 Apr 18:28:36.610 - Accepted 127.0.0.1:17513
[206040] 19 Apr 18:28:36.611 - Client closed connection
[206040] 19 Apr 18:28:36.615 - Accepted 172.16.186.205:11486
[206040] 19 Apr 18:28:36.615 - Client closed connection
[206040] 19 Apr 18:28:38.621 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:38.621 - 1 clients connected (1 slaves), 5889264 bytes in use
[206040] 19 Apr 18:28:43.627 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:43.627 - 1 clients connected (1 slaves), 5888208 bytes in use
[206040] 19 Apr 18:28:48.635 - DB 0: 13575 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:48.636 - 1 clients connected (1 slaves), 5888208 bytes in use
[206040] 19 Apr 18:28:53.577 - Client closed connection
[206040] 19 Apr 18:28:53.577 # Connection with slave 172.16.186.203:6379 lost.
[206040] 19 Apr 18:28:53.645 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:53.645 - 1 clients connected (0 slaves), 5870128 bytes in use
[206040] 19 Apr 18:28:57.632 - Accepted 172.16.186.203:18604
[206040] 19 Apr 18:28:57.632 * Slave 172.16.186.203:6379 asks for synchronization
[206040] 19 Apr 18:28:57.632 * Full resync requested by slave 172.16.186.203:6379
[206040] 19 Apr 18:28:57.633 * Starting BGSAVE for SYNC with target: disk
[206040] 19 Apr 18:28:57.634 * Background saving started by pid 206569
[206569] 19 Apr 18:28:57.690 * DB saved on disk
[206569] 19 Apr 18:28:57.691 * RDB: 12 MB of memory used by copy-on-write
[206040] 19 Apr 18:28:57.752 * Background saving terminated with success
[206040] 19 Apr 18:28:57.773 * Synchronization with slave 172.16.186.203:6379 succeeded
[206040] 19 Apr 18:28:58.653 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:28:58.653 - 1 clients connected (1 slaves), 5929296 bytes in use
[206040] 19 Apr 18:29:03.661 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:03.661 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:08.670 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:08.670 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:13.679 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:13.679 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:18.689 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:18.689 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:23.698 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:23.698 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:28.706 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:28.706 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:33.715 - DB 0: 13572 keys (0 volatile) in 16384 slots HT.
[206040] 19 Apr 18:29:33.715 - 1 clients connected (1 slaves), 5930992 bytes in use
[206040] 19 Apr 18:29:38.580 - Accepted 127.0.0.1:18538
[206040] 19 Apr 18:29:38.581 # User requested shutdown...
[206040] 19 Apr 18:29:38.581 * Saving the final RDB snapshot before exiting.
[206040] 19 Apr 18:29:38.636 * DB saved on disk
[206040] 19 Apr 18:29:38.636 * Removing the pid file.
[206040] 19 Apr 18:29:38.636 # Redis is now ready to exit, bye bye...
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 2.8.18 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in stand alone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 207901
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

[207901] 19 Apr 18:29:38.649 # Server started, Redis version 2.8.18
通过针对抓包,日志,和网络连接等进行分析,目前得出如下结论
  • shutdown 命令来自于 127.0.0.1 <-> 127.0.0.1 的 TCP 连接;
  • redis 接收 shutdown 后会自行关闭 socket ,所以 TIME_WAIT 状态在 redis 侧;
  • 每 55 秒左右 shutdown 一次;
结论:之前就怀疑是由于运维人员的脚本检测导致的问题,结果不幸命中~~

该脚本用于进行配置信息检测和变更
...
if [ "$backup_status"x = "0"x ];then
	grep -q "^bind 127.0.0.1 $RedisLocalInnerIp" $path2 || {
		sed -i "/^bind /c\bind 127.0.0.1 $RedisLocalInnerIp" $path2
		./start.sh	 
	}
else
	grep -q "^bind $VIP 127.0.0.1 $RedisLocalInnerIp" $path2 || {
		sed -i "/^bind /c\bind $VIP 127.0.0.1 $RedisLocalInnerIp" $path2
		./start.sh
	}
fi
...
该脚本用于检测 redis 进程运行情况(在某些检测状态下进行强杀)
...
cmd="ps aux|grep "/usr/local/redis/bin/redis-server"|grep -v grep|wc -l"
proc=$(eval $cmd)
if [ $proc == "1" ]; then
	/usr/local/redis/bin/redis-cli shutdown
elif [ $proc == "0" ]; then
	continue
else
	redis_pids=$(pidof /usr/local/redis/bin/redis-server)
	[ -z "$redis_pids" ] && echo "redis is not running" || (kill -9 $redis_pids && echo "$date redis is killed by stop.sh" >> $logpath)
fi
...
原因:在第一个脚本中针对配置检测的命令存在错误(上面已修正,错误太低级就不贴了),导致一直认为配置存在问题,进而在一定检测周期之后,重启 redis 。
[root@xnu_205 redis]#
[root@xnu_205 redis]# strace -tt -s 1024 -p 12936
Process 12936 attached
22:50:14.894121 epoll_wait(3, {}, 10128, 3) = 0
22:50:14.897436 open("/proc/12936/stat", O_RDONLY) = 7
22:50:14.897566 read(7, "12936 (redis-server) R 1 12936 12936 0 -1 4202816 768 0 0 0 6 1 0 0 20 0 3 0 508818644 41508864 2406 18446744073709551615 4194304 5108532 140726224740688 140726224735808 241015252957 0 0 4097 17610 18446744073709551615 0 0 17 12 0 0 0 0 0\n", 4096) = 239
22:50:14.897691 close(7)                = 0
...
22:50:56.525098 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 7
22:50:56.525176 fstat(7, {st_mode=S_IFREG|0644, st_size=1320763367, ...}) = 0
22:50:56.525245 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:50:56.525345 fstat(7, {st_mode=S_IFREG|0644, st_size=1320763367, ...}) = 0
22:50:56.525429 lseek(7, 1320763367, SEEK_SET) = 1320763367
22:50:56.525494 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:50:56.525594 write(7, "[12936] 15 Apr 22:50:56.525 - 1 clients connected (0 slaves), 3429072 bytes in use\n", 83) = 83
22:50:56.525732 close(7)                = 0
22:50:56.525803 munmap(0x7fdd373fd000, 4096) = 0
22:50:56.525892 epoll_ctl(3, EPOLL_CTL_MOD, 6, {EPOLLIN|EPOLLOUT, {u32=6, u64=6}}) = 0
22:50:56.525986 epoll_wait(3, {{EPOLLOUT, {u32=6, u64=6}}}, 10128, 100) = 1
22:50:56.526056 write(6, "*3\r\n$8\r\nREPLCONF\r\n$3\r\nACK\r\n$5\r\n51047\r\n", 38) = 38
22:50:56.526148 epoll_ctl(3, EPOLL_CTL_MOD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0
22:50:56.526228 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 10128, 99) = 1
22:50:56.563906 read(6, "*1\r\n$5\r\nMULTI\r\n*1\r\n$4\r\nEXEC\r\n", 16384) = 29
22:50:56.564249 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 10128, 61) = 1
22:50:56.565874 read(6, "*1\r\n$5\r\nMULTI\r\n*2\r\n$3\r\nDEL\r\n$33\r\ncollector:00:0C:29:DA:3B:48:timer\r\n*1\r\n$4\r\nEXEC\r\n", 16384) = 82
22:50:56.566160 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 10128, 59) = 1
22:50:56.567651 read(6, "*1\r\n$5\r\nMULTI\r\n*2\r\n$3\r\nDEL\r\n$32\r\ncollector:00:0C:29:DA:3B:48:info\r\n*1\r\n$4\r\nEXEC\r\n", 16384) = 81
22:50:56.567953 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 10128, 58) = 1
22:50:56.569163 read(6, "*1\r\n$5\r\nMULTI\r\n*3\r\n$4\r\nSREM\r\n$9\r\ncollector\r\n$17\r\n00:0C:29:DA:3B:48\r\n*1\r\n$4\r\nEXEC\r\n", 16384) = 82
22:50:56.569281 epoll_wait(3, {}, 10128, 56) = 0
22:50:56.625464 open("/proc/12936/stat", O_RDONLY) = 7
22:50:56.625566 read(7, "12936 (redis-server) R 1 12936 12936 0 -1 4202816 790 0 0 0 9 7 0 0 20 0 3 0 508818644 41508864 2410 18446744073709551615 4194304 5108532 140726224740688 140726224735808 241015252957 0 0 4097 17610 18446744073709551615 0 0 17 0 0 0 0 0 0\n", 4096) = 238
22:50:56.625656 close(7)                = 0
22:50:56.625765 epoll_wait(3, {}, 10128, 100) = 0
22:50:56.725989 open("/proc/12936/stat", O_RDONLY) = 7
22:50:56.726090 read(7, "12936 (redis-server) R 1 12936 12936 0 -1 4202816 790 0 0 0 9 7 0 0 20 0 3 0 508818644 41508864 2410 18446744073709551615 4194304 5108532 140726224740688 140726224735808 241015252957 0 0 4097 17610 18446744073709551615 0 0 17 0 0 0 0 0 0\n", 4096) = 238
22:50:56.726185 close(7)                = 0

22:50:56.726261 epoll_wait(3, {{EPOLLIN, {u32=5, u64=5}}}, 10128, 100) = 1
22:50:56.731228 accept(5, {sa_family=AF_INET, sin_port=htons(13397), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
22:50:56.731359 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
22:50:56.731438 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763450, ...}) = 0
22:50:56.731502 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:50:56.731565 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763450, ...}) = 0
22:50:56.731620 lseek(8, 1320763450, SEEK_SET) = 1320763450
22:50:56.731693 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:50:56.731800 write(8, "[12936] 15 Apr 22:50:56.731 - Accepted 127.0.0.1:13397\n", 55) = 55
22:50:56.731896 close(8)                = 0
22:50:56.731951 munmap(0x7fdd373fd000, 4096) = 0
22:50:56.732024 fcntl(7, F_GETFL)       = 0x2 (flags O_RDWR)
22:50:56.732076 fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
22:50:56.732126 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
22:50:56.732183 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
22:50:56.732246 accept(5, 0x7ffd60a2de70, [128]) = -1 EAGAIN (Resource temporarily unavailable)
22:50:56.732343 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 10128, 94) = 1

收到 shutdown 命令
22:50:56.732405 read(7, "*1\r\n$8\r\nshutdown\r\n", 16384) = 18
22:50:56.732477 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
22:50:56.732538 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763505, ...}) = 0
22:50:56.732592 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:50:56.732648 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763505, ...}) = 0
22:50:56.732735 lseek(8, 1320763505, SEEK_SET) = 1320763505
22:50:56.732827 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0

22:50:56.732925 write(8, "[12936] 15 Apr 22:50:56.732 # User requested shutdown...\n", 57) = 57
22:50:56.733010 close(8)                = 0
22:50:56.733064 munmap(0x7fdd373fd000, 4096) = 0
22:50:56.733134 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
22:50:56.733200 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763562, ...}) = 0
22:50:56.733261 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:50:56.733338 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763562, ...}) = 0
22:50:56.733403 lseek(8, 1320763562, SEEK_SET) = 1320763562
22:50:56.733490 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:50:56.733583 write(8, "[12936] 15 Apr 22:50:56.733 * Saving the final RDB snapshot before exiting.\n", 76) = 76
22:50:56.733664 close(8)                = 0
22:50:56.733734 munmap(0x7fdd373fd000, 4096) = 0
22:50:56.733805 open("temp-12936.rdb", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 8
22:50:56.733928 fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
22:50:56.733982 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000

开始生成 RDB snapshot
22:50:56.734648 write(8, "REDIS0006\376\0\r\37terminal:1211110000225:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$5c97551e-9315-4b4b-b1b8-abd99bad5ca1&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340a\6\310\373\31\1\0\0\n\4e164\6\340a\6\310\373\31\1\0\0\377\r\37terminal:3333330000803:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$7f1531e2-03a7-4ec9-a9c1-ad643c960c0c&\vdomain_moid\r\0302ttyq3nfet50d7m2iubntqmd\32\4name\6\340\243[\363\31\10\3\0\0\n\4e164\6\340\243[\363\31\10\3\0\0\377\r\37terminal:3333330000534:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$f9f27425-4321-45d0-b152-e97231326e16&\vdomain_moid\r\0302ttyq3nfet50d7m2iubntqmd\32\4name\6\340\226Z\363\31\10\3\0\0\n\4e164\6\340\226Z\363\31\10\3\0\0\377\r6terminal:caa5e471-ec55-41da-8d60-c39edfb52ba8:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$caa5e471-ec55-41da-8d60-c39edfb52ba8&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340\217\5\310\373\31\1\0\0\n\4e164\6\340\217\5\310\373\31\1\0\0\377\r6terminal:14a22ddd-1cee-445c-b2b3-1fe78e434d9e:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$14a22ddd-1cee-445c-b2b3-1fe78e434d9e&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340)\6\310\373\31\1\0\0\n\4e164\6\340)\6\310\373\31\1\0\0\377\r\37terminal:1211110004968:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$33c39589-7a57-4b4f-8400-16fc464cd286&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340\350\30\310\373\31\1\0\0\n\4e164\6\340\350\30\310\373\31\1\0\0\377\r"..., 4096) = 4096
...
22:51:09.970594 write(8, "10001101:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$c53442cc-43ce-48e5-a48e-9b27143245aa&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340\315\t\310\373\31\1\0\0\n\4e164\6\340\315\t\310\373\31\1\0\0\377\r6terminal:5c31d1df-5d74-443d-aa4a-6c39851be016:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$5c31d1df-5d74-443d-aa4a-6c39851be016&\vdomain_moid\r\0302ttyq3nfet50d7m2iubntqmd\32\4name\6\340<\\\363\31\10\3\0\0\n\4e164\6\340<\\\363\31\10\3\0\0\377\r6terminal:398b8203-8220-47cf-8ac5-ee4f84c3eea7:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$398b8203-8220-47cf-8ac5-ee4f84c3eea7&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340 \n\310\373\31\1\0\0\n\4e164\6\340 \n\310\373\31\1\0\0\377\r\37terminal:1211110000206:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$fedd4c38-64ae-4123-87f4-c843fb0dafab&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340N\6\310\373\31\1\0\0\n\4e164\6\340N\6\310\373\31\1\0\0\377\r\37terminal:1211110001987:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$b44341f7-e1f0-4748-9de0-df938d9ce39f&\vdomain_moid\r\30fya1iwz59u7s7xpzapyrq9an\32\4name\6\340C\r\310\373\31\1\0\0\n\4e164\6\340C\r\310\373\31\1\0\0\377\r6terminal:a8728c20-8af2-462d-9728-3a168a30e574:baseinfo@~~\0\0\0s\0\0\0\10\0\0\4moid\6$a8728c20-8af2-462d-9728-3a168a30e574&\vdomain_moid\r\0302ttyq3nfet50d7m2iubntqmd\32\4name\6\340\241\\\363\31\10\3\0\0\n\4e164\6\340\241\\\363\31\10\3\0\0\377\377|\345\264*"..., 1028) = 1028
22:51:09.970741 fsync(8)                = 0
22:51:09.979056 close(8)                = 0
22:51:09.979157 munmap(0x7fdd373fd000, 4096) = 0
22:51:09.979251 rename("temp-12936.rdb", "dump.rdb") = 0
22:51:09.980247 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
22:51:09.980338 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763638, ...}) = 0
22:51:09.980410 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:51:09.980481 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763638, ...}) = 0
22:51:09.980545 lseek(8, 1320763638, SEEK_SET) = 1320763638
22:51:09.980611 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:51:09.980717 write(8, "[12936] 15 Apr 22:51:09.980 * DB saved on disk\n", 47) = 47
22:51:09.980822 close(8)                = 0
22:51:09.980885 munmap(0x7fdd373fd000, 4096) = 0
22:51:09.980964 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
22:51:09.981037 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763685, ...}) = 0
22:51:09.981101 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:51:09.981167 fstat(8, {st_mode=S_IFREG|0644, st_size=1320763685, ...}) = 0
22:51:09.981229 lseek(8, 1320763685, SEEK_SET) = 1320763685
22:51:09.981292 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:51:09.981380 write(8, "[12936] 15 Apr 22:51:09.981 * Removing the pid file.\n", 53) = 53
22:51:09.981462 close(8)                = 0
22:51:09.981522 munmap(0x7fdd373fd000, 4096) = 0
22:51:09.981599 unlink("/var/run/redis.pid") = 0
22:51:09.981756 close(4)                = 0
22:51:09.981847 close(5)                = 0
22:51:09.981939 open("/usr/log/redis/redis.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
22:51:09.982038 fstat(4, {st_mode=S_IFREG|0644, st_size=1320763738, ...}) = 0
22:51:09.982115 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd373fd000
22:51:09.982198 fstat(4, {st_mode=S_IFREG|0644, st_size=1320763738, ...}) = 0
22:51:09.982270 lseek(4, 1320763738, SEEK_SET) = 1320763738
22:51:09.982345 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=388, ...}) = 0
22:51:09.982477 write(4, "[12936] 15 Apr 22:51:09.982 # Redis is now ready to exit, bye bye...\n", 69) = 69
22:51:09.982575 close(4)                = 0
22:51:09.982636 munmap(0x7fdd373fd000, 4096) = 0
22:51:09.982764 exit_group(0)           = ?
22:51:09.984301 +++ exited with 0 +++
[root@xnu_205 redis]# 
[root@xnu_205 redis]#

问题三:终端设备通过 HTTP 协议经由 nginx 访问后端的 api 服务器时,TCP 连接行为诡异

...
Apr 20 18:21:48 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:24:37 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:25:50 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:27:02 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:29:01 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:30:14 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:31:28 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:32:44 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:35:33 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:37:06 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:37:52 localhost ntpd_intres[1732]: host name not found: 0.centos.pool.ntp.org
Apr 20 18:38:12 localhost ntpd_intres[1732]: host name not found: 1.centos.pool.ntp.org
Apr 20 18:38:20 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:38:32 localhost ntpd_intres[1732]: host name not found: 2.centos.pool.ntp.org
Apr 20 18:38:52 localhost ntpd_intres[1732]: host name not found: 3.centos.pool.ntp.org
Apr 20 18:39:29 localhost kernel: possible SYN flooding on port 80. Sending cookies.
Apr 20 18:40:43 localhost kernel: possible SYN flooding on port 80. Sending cookies.
...
详情参见:《【原创】线上环境 SYN flooding 问题排查
 


共有 人打赏支持
粉丝 358
博文 352
码字总数 952596
评论 (5)
superdakevin
飞哥飞哥 是ejabberd的erlang版本和其他程序的erlang版本不一致导致的么
摩云飞

引用来自“superdakevin”的评论

飞哥飞哥 是ejabberd的erlang版本和其他程序的erlang版本不一致导致的么
怀疑是 .beam 文件的版本和 rel 目录下的 .boot 的版本不一致
superdakevin

引用来自“superdakevin”的评论

飞哥飞哥 是ejabberd的erlang版本和其他程序的erlang版本不一致导致的么

引用来自“摩云飞”的评论

怀疑是 .beam 文件的版本和 rel 目录下的 .boot 的版本不一致
飞哥,原因查到了。是同一时间多个进程调用ejabberdctl,导致erlang中-name的名字相同,产生的dump
摩云飞

引用来自“superdakevin”的评论

飞哥飞哥 是ejabberd的erlang版本和其他程序的erlang版本不一致导致的么

引用来自“摩云飞”的评论

怀疑是 .beam 文件的版本和 rel 目录下的 .boot 的版本不一致

引用来自“superdakevin”的评论

飞哥,原因查到了。是同一时间多个进程调用ejabberdctl,导致erlang中-name的名字相同,产生的dump
如果我没记错的话,调用 ejabberdctl 的时候使用的名字应该带随机数后缀的,不应该名字相同才对啊
superdakevin

引用来自“superdakevin”的评论

飞哥飞哥 是ejabberd的erlang版本和其他程序的erlang版本不一致导致的么

引用来自“摩云飞”的评论

怀疑是 .beam 文件的版本和 rel 目录下的 .boot 的版本不一致

引用来自“superdakevin”的评论

飞哥,原因查到了。是同一时间多个进程调用ejabberdctl,导致erlang中-name的名字相同,产生的dump

引用来自“摩云飞”的评论

如果我没记错的话,调用 ejabberdctl 的时候使用的名字应该带随机数后缀的,不应该名字相同才对啊
之前不是,是ejabberd-ctl@${ERLANG_NODE#*@}这个名字,后来我在后面加了一个随机数,就已经基本上没有dump文件了
×
摩云飞
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: