k8s关于Orphaned pod <pod_id> found
2021-12-12 本文已影响0人
Linux丶晨星
问题描述
因k8s节点异常关机导致启动后业务Pod重新部署,关机之前的Pod状态已被删除,今天在查看日志时发现在异常关机之前的集群节点Pod是非正常移除的,一直刷报错信息;如下:
问题排查
查看系统日志/var/log/messages
发现一直在刷kubectl
服务的以下的报错,从错误信息可以看到,这台节点存在一个孤儿Pod,并且该Pod挂载了数据卷(volume),阻碍了Kubelet
对孤儿Pod正常的回收清理。
[root@sss-010xl-n02 ~]# tail -3 /var/log/messages
Dec 12 17:50:17 sss-010xl-n02 bash[470923]: user=root,ppid=454652,from=,pwd=/var/lib/kubelet/pods,command:20211212-175006: ll
Dec 12 17:55:15 sss-010xl-n02 kubelet: E1212 17:55:15.645612 2423 kubelet_volumes.go:154] Orphaned pod "aad90ab1-2f04-11ec-b488-b4055dae3f29" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Dec 12 17:55:15 sss-010xl-n02 kubelet: E1212 17:55:15.645612 2423 kubelet_volumes.go:154] Orphaned pod "aad90ab1-2f04-11ec-b488-b4055dae3f29" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
通过pod_id
号,进入kubelet的目录,可以发现里面装的是容器的数据,etc-hosts
文件中还保留着Pod_name
[root@sss-010xl-n02 ~]# cd /var/lib/kubelet/pods/aad90ab1-2f04-11ec-b488-b4055dae3f29
[root@sss-010xl-n02 pods]# cd aad90ab1-2f04-11ec-b488-b4055dae3f29/
[root@sss-010xl-n02 aad90ab1-2f04-11ec-b488-b4055dae3f29]# ll
total 4
drwxr-x--- 3 root root 30 Dec 10 15:54 containers
-rw-r--r-- 1 root root 230 Dec 10 15:54 etc-hosts
drwxr-x--- 3 root root 37 Dec 10 15:54 plugins
drwxr-x--- 5 root root 82 Dec 10 15:54 volumes
drwxr-x--- 3 root root 49 Dec 10 15:54 volume-subpaths
[root@sss-010xl-n02 7e1a3af8-598e-11ec-b488-b4055dae3f29]# cat etc-hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
172.30.128.2 sss-wanted-010xl-5945fb4885-7gz85 \\被孤立的Pod
解决问题
首先通过etc-hosts
文件的pod_name
发现已经没有相关的实例在运行了,所以直接删除pod的目录即可
[root@sss-010xl-n02 7e1a3af8-598e-11ec-b488-b4055dae3f29]# cd ..
[root@sss-010xl-n02 pods]# rm -rf 7e1a3af8-598e-11ec-b488-b4055dae3f29/
网上看其他人的博客都说这个方法有一定的危险性,还不确认是否有数据丢失的风险,如果可以确认,再执行;如果是无状态服务,一般没有问题。
再去查看日志,就不会再刷这样的告警日志了