Docker实战三-Docker虚拟化运维故障剖析与排除

原创
2019/10/31 17:34
阅读数 431

概述

维护生产环境中的Docker虚拟化应用,高效、稳定的运行至关重要。但是,对于Docker的初学者而言,当容器或应用出现了问题,往往不知从何入手进行排查。本文主要剖析以及解决Docker在实际生产中遇到一些常见故障

Docker虚拟化故障

Docker虚拟化主要有三类故障:

  • 应用故障: 应用执行状态与预期不一致;
  • 容器故障: 无法正确创建、停止、更新容器等;
  • 集群故障: 集群创建失败、更新失败、无法连接等;

Docker虚拟化故障排错

所有的Docker虚拟化故障排查诊断,都可以通过Docker命令行工具或者Web控制台来完成。

通过WEB控制台查看,需要自建控制台,对于Docker运维管理员来说,命令行工具排错是一个不错的帮手。

常见故障分析与排除

8种常见故障汇总

https://mp.weixin.qq.com/s/2GNKmRJtBGHhUyVBRbRgeA

衍生故障

报错一:error initializing graphdriver

    [root[@docker](https://my.oschina.net/u/2394134) ~]# systemctl start docker
    Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journ
    [root[@docker](https://my.oschina.net/u/2394134) ~]# 
    [root[@docker](https://my.oschina.net/u/2394134) ~]# 
    [root[@docker](https://my.oschina.net/u/2394134) ~]# 
    [root[@docker](https://my.oschina.net/u/2394134) ~]# systemctl status docker
    ● docker.service - Docker Application Container Engine
       Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
       Active: failed (Result: start-limit) since 日 2018-04-22 20:52:39 CST; 5s ago
         Docs: https://docs.docker.com
      Process: 4810 ExecStart=/usr/bin/dockerd (code=exited, status=1/FAILURE)
     Main PID: 4810 (code=exited, status=1/FAILURE)
    
    4月 22 20:52:39 docker.cgy.com systemd[1]: Failed to start Docker Application Container Engine.
    4月 22 20:52:39 docker.cgy.com systemd[1]: Unit docker.service entered failed state.
    4月 22 20:52:39 docker.cgy.com systemd[1]: docker.service failed.
    4月 22 20:52:39 docker.cgy.com systemd[1]: docker.service holdoff time over, scheduling restart.
    4月 22 20:52:39 docker.cgy.com systemd[1]: start request repeated too quickly for docker.service
    4月 22 20:52:39 docker.cgy.com systemd[1]: Failed to start Docker Application Container Engine.
    4月 22 20:52:39 docker.cgy.com systemd[1]: Unit docker.service entered failed state.
    4月 22 20:52:39 docker.cgy.com systemd[1]: docker.service failed.

从以上报错提示信息中也没看到错误的具体原因。然后用dockerd来直接启动,就在输出信息最下面看到一条错误提示,如下:

    [root@docker ~]# dockerd
    INFO[2018-04-22T21:12:46.111704443+08:00] libcontainerd: started new docker-containerd process  pid=5903
    INFO[0000] starting containerd                           module=containerd revision=773c489c9c1b21a6d78b5c538cd395416ec50f88 version=v1.0.3
    
    。。。。。。省略一部分输出。。。。。。
    
    INFO[0000] loading plugin "io.containerd.grpc.v1.introspection"...  module=containerd type=io.containerd.grpc.v1
    INFO[0000] serving...                                    address="/var/run/docker/containerd/docker-containerd-debug.sock" module="containerd/debug"
    INFO[0000] serving...                                    address="/var/run/docker/containerd/docker-containerd.sock" module="containerd/grpc"
    INFO[0000] containerd successfully booted in 0.002763s   module=containerd
    Error starting daemon: error initializing graphdriver: overlay: the backing xfs filesystem is formatted without d_type support, which leads to incorrect behavior. Reformat the filesystem with ftype=1 to en d_type support. Backing filesystems without d_type support are not supported.

根据最后的报错Error starting daemon:搜索到这篇博客,得到解决。 https://blog.csdn.net/liu9718214/article/details/79134900

报错二:iptables failed

FirewallD

CentOS-7中介绍了firewalld,firewall的底层是使用iptables进行数据过滤,建立在iptables之上,这可能会与Docker产生冲突。

当firewalld启动或者重启的时候,将会从iptables中移除DOCKER的规则,从而影响了 Docker 的正常工作。

当你使用的是Systemd 的时候, firewalld 会在 Docker 之前启动,但是如果你在 Docker启动之后再启动或者重启firewalld ,你就需要重启Docker进程了。

系统:

    [root@controller ~]# cat /etc/redhat-release 
    CentOS Linux release 7.2.1511 (Core)

报错提示如下:

    [root@controller ~]# docker run -it -P docker.io/nginx
    /usr/bin/docker-current: Error response from daemon: driver failed programming external connectivity on endpoint gloomy_kirch (10289e7a87e65771da90cda531951b7339bee9cb5953474460451cd48013aff0): iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32810 -j DNAT --to-destination 172.17.0.2:80 ! -i docker0: iptables: No chain/target/match by that name.
     (exit status 1).

这是由于在运行这次容器之前,成功启动过一次,在上次访问时,因为防火墙的问题导致不能正常访问Nginx,所以将iptables的filter表清空了,并且重启过iptables,然后再次运行时,就报了以上错误。

解决办法

重启防火墙

    #CentOS 7下执行
    [root@controller ~]# systemctl restart firewalld
    [root@controller ~]# systemctl restart docker

再次在容器中运行一个nginx就不会报错了.

     [root@controller ~]# docker run -it --name nginx -p 80:80 -v /www:/wwwroot docker.io/nginx /bin/bash
    root@a8a92c8f7760:/# 

报错三 : Unable to take ownership of thin-pool

docker daemon启动失败:Unable to take ownership of thin-pool

    Apr 27 13:51:59 master systemd: Started Docker Storage Setup.
    Apr 27 13:51:59 master systemd: Starting Docker Application Container Engine...
    Apr 27 13:51:59 master dockerd-current: time="2018-04-27T13:51:59.088441356+08:00" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
    Apr 27 13:51:59 master dockerd-current: time="2018-04-27T13:51:59.091166189+08:00" level=info msg="libcontainerd: new containerd process, pid: 20930"
    Apr 27 13:52:00 master dockerd-current: Error starting daemon: error initializing graphdriver: devmapper: Unable to take ownership of thin-pool (docker--vg-docker--pool) that already has used data blocks
    Apr 27 13:52:00 master systemd: docker.service: main process exited, code=exited, status=1/FAILURE
    Apr 27 13:52:00 master systemd: Failed to start Docker Application Container Engine.
    Apr 27 13:52:00 master systemd: Unit docker.service entered failed state.
    Apr 27 13:52:00 master systemd: docker.service failed

原因: /var/lib/docker/devicemapper/metadata/ 内metadata丢失

workaround:https://bugzilla.redhat.com/show_bug.cgi?id=1321640#c5

    I feel like the kcs kinda misses telling users the actual problem. Nor does it really make it clear the solution.
    
    IF you are using device mapper (instead of loopback) /var/lib/docker contains metadata informing docker about the contents of the device mapper storage area. If you delete /var/lib/docker that metadata is lost. Docker is then able to detect that the thin pool has data but docker is unable to make use of that information. The only solution is to delete the thin pool and recreate it so that both the thin pool and the metadata in /var/lib/docker will be empty.

解决办法:

    执行命令:rm -rf /var/lib/docker/*
    
    执行命令:rm -rf /etc/sysconfig/docker-storage
    
    执行命令:lvremove /dev/docker-vg/docker-pool
    
    使用现有的docker-vg LVM卷组:
    cat  /etc/sysconfig/docker-storage-setup
    VG=docker-vg
    
    执行命令:docker-storage-setup
    
    重启docker即可:systemctl start docker

Docker对于新手来说在运维过程中可能遇到问题远不止这些,尤其在生产环境,本文主要对常见问题进行剖析,后续如果有问题产生会更新文档,希望本文能帮助你解决一些关于Docker 运维常见问题。

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部