背景:
节前给客户部署的平台出现异常,表现结果为部署项目时一直卡在5%
无法正常部署。异常原因是客户的机器在aws公有云上,他们设置的策略是一旦应用无法访问就会删除整个机器并重新创建,最终导致部署在上面的组件全部丢失……
在重新部署了相关组件后发现依然无法启动相关的服务,那么就需要看一下报错信息。
- 登录到客户的机器上查看一下服务运行状态
[root@10-251-180-180 ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@10-251-180-180 system]# systemctl status builder.service
● earth-builder.service - builder Container
Loaded: loaded (/usr/lib/systemd/system/builder.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2021-09-30 15:27:13 CST; 1 weeks 0 days ago
Process: 57932 ExecStopPost=/bin/docker rm -f builder.service (code=exited, status=0/SUCCESS)
Process: 57922 ExecStart=/bin/docker run --rm --privileged --net=host --name builder.service -e BUILDER_PROFILE=dev -v /var/run/docker.sock:/var/run/docker.sock -v /bin/docker:/bin/docker -v /data/server/builder/log:/data/server/builder/log -v /data/server/builder/conf/builder.cfg:/data/server/builder/conf/builder/dev/builder.cfg -v /data/server/builder/conf/certs/dev/ssl:/data/server/builder/conf/certs/dev/rabbitmq/ssl -v /data/server/builder/tmp/workspace:/data/server/builder/tmp/workspace -v /data/server/earth-service-template:/data/server/earth-service-template -v /etc/localtime:/etc/localtime registry-poc.cnnol.uds-qa.lenovo.com/xcloud-product/earth-builder:1.0.35 (code=exited, status=127)
Process: 57914 ExecStartPre=/bin/docker rm -f builder.service (code=exited, status=0/SUCCESS)
Process: 57904 ExecStartPre=/bin/docker stop builder.service (code=exited, status=1/FAILURE)
Main PID: 57922 (code=exited, status=127)
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: Unit builder.service entered failed state.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: builder.service failed.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: builder.service holdoff time over, scheduling restart.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: Stopped builder Container.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: start request repeated too quickly for builder.service
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: Failed to start builder Container.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: Unit builder.service entered failed state.
Sep 30 15:27:13 10-251-180-180.earth-paas systemd[1]: builder.service failed.
复制代码
结果没有任何容器在运行。那就只好看一下服务的运行日志了
[root@10-251-180-180 system]# journalctl -f -u builder.service
-- Logs begin at Mon 2021-09-20 23:00:31 CST. --
Oct 08 10:24:42 10-251-180-180.earth-paas systemd[1]: Starting builder Container...
Oct 08 10:24:42 10-251-180-180.earth-paas docker[71609]: Error response from daemon: No such container: builder.service
Oct 08 10:24:42 10-251-180-180.earth-paas docker[71617]: Error: No such container: builder.service
Oct 08 10:24:42 10-251-180-180.earth-paas systemd[1]: Started builder Container.
Oct 08 10:24:42 10-251-180-180.earth-paas docker[71627]: Unable to find image 'xxxxx/builder:1.0.35' locally
Oct 08 10:24:42 10-251-180-180.earth-paas docker[71627]: 1.0.35: Pulling from xxxxx/builder
Oct 08 10:24:42 10-251-180-180.earth-paas docker[71627]: 534e72e7cedc: Pulling fs layer
Oct 08 10:24:43 10-251-180-180.earth-paas docker[71627]: docker: open /data/docker/tmp/GetImageBlob205343402: no such file or directory.
Oct 08 10:24:43 10-251-180-180.earth-paas docker[71627]: See 'docker run --help'.
Oct 08 10:24:43 10-251-180-180.earth-paas systemd[1]: builder.service: main process exited, code=exited, status=127/n/a
[root@10-251-180-180 system]# docker pull registry-poc.cnnol.uds-qa.lenovo.com/xcloud-product/earth-builder:1.0.35
1.0.35: Pulling from xcloud-product/earth-builder
534e72e7cedc: Pulling fs layer
924d479f8494: Pulling fs layer
530c8d5bb194: Pulling fs layer
25f403377f83: Pulling fs layer
fefb0ce67cb3: Waiting
open /data/docker/tmp/GetImageBlob691109094: no such file or directory
复制代码
问题已经定位到无法拉取镜像提示open /data/docker/tmp/GetImageBlob691109094: no such file or directory
那我们看一下docker运行状态是否正常,最终发现docker运行也正常。
[root@10-251-180-180 system]# docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.6.1-docker)
scan: Docker Scan (Docker Inc., v0.8.0)
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 19.03.15
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e25210fe30a0a703442421b0f60afac609f950a3
runc version: v1.0.1-0-g4144b63
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.31.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.22GiB
Name: 10-251-180-180.earth-paas
ID: CXRK:PGI5:L7UF:5ZPB:42AV:34PJ:HCV7:NNCH:UNPQ:5T2Q:HW4Y:QLOM
Docker Root Dir: /data/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
复制代码
那我们删除docker的数据,依然无法启动
[root@10-251-180-180 system]# rm -rf /data/docker/*
复制代码
查看docker数据盘详情
[root@10-251-180-180 system]# mount -n |grep data # (有两个挂载点这是什么操作)
/dev/nvme0n1p1 on /data type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
/dev/nvme1n1 on /data type ext3 (rw,relatime,seclabel,data=ordered)
[root@10-251-180-180 system]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.6G 0 7.6G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 7.7G 25M 7.6G 1% /run
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 2.4G 18G 12% /
/dev/nvme1n1 197G 61M 187G 1% /data
tmpfs 1.6G 0 1.6G 0% /run/user/1001
复制代码
最终解决卸载/dev/nvme0n1p1
重启docker即可