etcd报错记录
2022-02-24 本文已影响0人
开始懂了90
问题描述:
早上master所在的物理节点主机故障,导致虚拟机漂移,导致etcd应用异常 容器异常如下
[root@region-master2 ~]# kubectl get po -nkube-system -owide |grep 32.45
etcd-region-master1 0/1 CrashLoopBackOff 69 4m35s 10.39.32.45 region-master1 <none> <none>
kube-apiserver-region-master1 0/1 CrashLoopBackOff 56 4m24s 10.39.32.45 region-master1 <none> <none>
查看etcd的报错日志如图:
image.png
解决办法
etcd增加节点和剔除节点
剔除节点(剔除有问题的节点,让其重新加入集群同步数据)(举例要剔除的对象是https://192.168.1.73:2379)
member list打印出所有节点的节点ID
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379,https://192.168.1.73:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list -w table
member remove 对应的节点ID
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379,https://192.168.1.73:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member remove f926bd1d34241ce0
Member f926bd1d34241ce0 removed from cluster 6294eac8c3e80ca
此时运行member list对应的节点会在集群消失,被剔除节点etcd进程会退出
添加节点
清空etcd数据目录(etcd异常主机上操作)
$ rm -rf /var/lib/etcd/*
确认/etc/kubernetes/manifests/etcd.yaml中spec.containers.command里的3个参数
1. --initial-cluster-state=existing
2. --initial-cluster的值是否是全集群
3. --name成 员名
# 注意不要将etcd.yaml 备份到 /etc/kubernetes/manifests/这个目录,不然会有2个etcd ,kubectl 启动是会加载这个目录下所有配置文件
修改好后在正常运行etcd的节点执行以下命令,endpoints只填写当前集群现有的节点,member add后面添加的是--name的ming c
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member add 192.168.1.73 --peer-urls=https://192.168.1.73:2380
Member 2226f8cff2cbbfa9 added to cluster 6294eac8c3e80ca
ETCD_NAME="192.168.1.73"
ETCD_INITIAL_CLUSTER="192.168.1.71=https://192.168.1.71:2380,192.168.1.73=https://192.168.1.73:2380,192.168.1.72=https://192.168.1.72:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.1.73:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
重启kubelet,让其重新拉起etcd
$ systemctl restart kubelet