K8S集群中Pod挂载Storageclass存储卷异常排查思路
故障描述:
Jenkins是在K8S集群中部署的,Jenkins使用的各种资源以及全部创建了,但是Jenkins的Pod依旧无法启动,一直处于Pending状态。
排查思路:
1)首先查看处于Pending状态的原因,观察Pod的详细信息,获取关键信息。
Warning FailedMount 34s (x3 over 98s) kubelet (combined from similar events): MountVolume.SetUp failed for volume "pvc-3ed2c605-b7da-4266-a882-4527ed949c34" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a794f97d-7e67-47f5-9ab2-05f6352b9352/volumes/kubernetes.io~nfs/pvc-3ed2c605-b7da-4266-a882-4527ed949c34 --scope -- mount -t nfs 192.168.16.105:/data/k8s/storageclass/kube-system-prometheus-data-prometheus-0-pvc-3ed2c605-b7da-4266-a882-4527ed949c34 /var/lib/kubelet/pods/a794f97d-7e67-47f5-9ab2-05f6352b9352/volumes/kubernetes.io~nfs/pvc-3ed2c605-b7da-4266-a882-4527ed949c34
Output: Running scope as unit run-2832.scope.
mount.nfs: mounting 192.168.16.105:/data/k8s/storageclass/kube-system-prometheus-data-prometheus-0-pvc-3ed2c605-b7da-4266-a882-4527ed949c34 failed, reason given by server: No such file or directory
2)通过日志中可以看到会输出大量关于PVC挂载的报错信息,首先来看这句话:
MountVolume.SetUp failed for volume "pvc-3ed2c605-b7da-4266-a882-4527ed949c34" : mount failed: exit status 32
意思是说无法挂载这个pvc-3ed2c605-b7da-4266-a882-4527ed949c34
PVC卷,已经存在了。
紧接着看这句话:
mounting 192.168.16.105:/data/k8s/storageclass/kube-system-prometheus-data-prometheus-0-pvc-3ed2c605-b7da-4266-a882-4527ed949c34 failed, reason given by server: No such file or directory
意思就是说这个pvc-3ed2c605-b7da-4266-a882-4527ed949c34
PVC卷在NFS中是/data/k8s/storageclass/kube-system-prometheus-data-prometheus-0-pvc-3ed2c605-b7da-4266-a882-4527ed949c34这个路径,但是现在这个路径在NFS中已经不存在了。
3)通过Pod输出的日志,我们基本上就可以定位问题所在了,Pod要挂载Storageclass创建的PVC卷,挂载的这个PVC卷原来可能创建过,但是存储空间被删除了,从而导致无法被挂载,
4)产生这种情况90%的原因有以下几种:
-
之前在K8S集群中部署过Jenkins,并且也是使用的Storageclass作为持久化存储,后来因为某些原因将Jenkins服务在K8S集群中删除了,同时也将PVC再NFS上的的持久化路径也删除了。
-
运维人员将NFS中Jenkins PVC的存储路径删除了,导致Jenkins重新启动时找不到存储。
5)最有可能的原因就是之前部署过Jenkins,删除重建时,也将存储路径删除了。
6)既然全部都删除了,为什么Storageclass不再重新创建一个新的PVC呢?原因其实也简单,我们虽然只是将存储数据的目录删除了,但是并没有删除Storageclass创建的PVC啊,当Jenkins再次重新部署时,Storageclass发现之前为Jenkins创建过PVC,那么就可以接着使用,没必要再创建一个新的PVC,占用系统资源。
7)解决方法很简单,只需要将之前Jenkins创建的PVC删除,然后重新部署Jenkins就可以了。
可以看到有很多Jenkins的PV资源,找到对应的PVC,先删除PVC,再删除PV即可解决。
# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nginx-conf-pv 1Gi RWX Retain Bound default/nginx-conf-pvc 30d
pvc-246774a7-11d8-4537-8b44-5c067ba80d04 10Gi RWX Retain Released grafana/grafana-data-grafana-0 grafana-storageclass 11d
pvc-3ed2c605-b7da-4266-a882-4527ed949c34 10Gi RWX Retain Bound kube-system/prometheus-data-prometheus-0 prometheus-storageclass 11d
pvc-7a013e82-e454-4883-a99e-81d23498b63e 10Gi RWX Retain Released jenkins/gitlab-data-gitlab-0 gitlab-storageclass 24d
pvc-94284330-69cd-48b5-908d-a6de8c41ffaa 10Gi RWX Retain Released grafana/grafana-data-grafana-0 grafana-storageclass 11d
pvc-991ee7c5-6752-44ba-9cd7-d55e5018f883 10Gi RWX Retain Bound jenkins/jenkins-data-jenkins-master-0 jenkins-storageclass 23d
pvc-99342491-daa6-43df-9c4c-827478926ced 10Gi RWX Retain Bound jenkins/gitlab-data-gitlab-0 gitlab-storageclass 24d
pvc-abc35446-059e-4916-a5f8-327e5cdc4954 1Gi RWX Retain Released jenkins/gitlab-config-gitlab-0 gitlab-storageclass 24d
pvc-cf1acd07-4a0e-4340-8626-01a857a9cee6 1Gi RWX Retain Released jenkins/gitlab-config-gitlab-0 gitlab-storageclass 24d
pvc-cf93289a-1cd3-4a3b-aac2-749eea30342a 10Gi RWX Retain Released jenkins/gitlab-data-gitlab-0 gitlab-storageclass 24d
pvc-ddc811c2-67ff-42cc-a163-c82c3dadc1be 1Gi RWX Retain Bound jenkins/gitlab-config-gitlab-0 gitlab-storageclass 24d
pvc-e3af12d8-a34e-4110-881c-e070cdb3e847 10Gi RWX Retain Released grafana/gitlab-data-grafana-0 gitlab-storageclass 11d