Etcd故障排除--recovering backend from snapshot error: failed to find database snapshot file (snap: snaps
1. 故障现象
服务器意外掉电,harbor故障,k8s无法启动
发现Harbor无法连接并有报错.
Harbor修复过程详见:
https://blog.csdn.net/qq_29974229/article/details/125257797
排除Harbor报错后发现k8s依然无法获取相关信息.
连接etcd发现etcd启动失败.
journalctl -xe
发现关键性报错:
recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
很明显就是数据库文件损坏了,找到报错了接下来就简单了.
2. 故障分析
进一步检查发现,etcd1和etcd2 都有该报错,但etcd3却没有.
etcd1
etcd2
etcd3
可见至少ETCD3 的是正常的,那么我就有个大胆的想法,删除1和2的,把3的复制过去然后启动.
3. 故障解决
3.1 先做个备份
etcd1和etcd2上执行:
mv /var/lib/etcd/member /opt/
3.2 复制数据
将etcd3正常的数据复制到etcd1和etcd2上
scp /var/lib/etcd/member 192.168.31.106:/var/lib/etcd/
scp /var/lib/etcd/member 192.168.31.107:/var/lib/etcd/
3.3 启动etcd1和etcd2
启动etcd1和etcd2上的etcd
systemctl start etcd
非常顺利的启动了etcd1和etcd2.
3.4 启动etcd3
但是启动etcd3时发觉起不来…
现在etcd1和etcd2都正常了,那么就把etcd3的也删了,然后让它自动从1和2上同步就可以了.
k8s恢复正常
4. 后续
通过备份恢复数据
export ETCDCTL_API=3
etcdctl snapshot restore 20220530_161001_snapshot.db --data-dir /var/lib/etcd
后续需要加上etcd的备份
#!/bin/sh
cd /var/lib
name="etcd-bak"`date "+%Y%m%d"`
tar -cvf "/backup/etcd/"$name".tar.gz" /var/lib/etcd