【K8s 精选】CKA - 如何排查应用故障
1.查看应用状态
#查找非运行态的pod
$kubectl get pod |grep 0/1
deployment-flink-jobmanager-7bc59d769-brqzd 0/1 Evicted 0 5min
#查看正常Pod的详情,关键信息是事件Events
$kubectl describe pod deployment-flink-jobmanager-57b59994f8-4lqw6
Name: deployment-flink-jobmanager-57b59994f8-4lqw6
Namespace: default
Priority: 0
Node: node-3/192.168.0.248
Start Time: Wed, 16 Mar 2022 01:46:11 +0000
Labels: app=flink
component=jobmanager
pod-template-hash=57b59994f8
Annotations: metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
Status: Running
IP: 10.244.2.251
IPs:
IP: 10.244.2.251
Controlled By: ReplicaSet/deployment-flink-jobmanager-57b59994f8
Containers:
jobmanager:
Container ID: docker://c90f6cc947e20cddd8b72e99411dc58697f319f10653e1c12aa3dfda3a9a518e
Image: 192.168.0.60:5000/test/flink:2022.0221.1542.00
Image ID: docker-pullable://192.168.0.60:5000/test/flink@sha256:2f31389c4b5ac444ed03e174b2a0fe9c5e23469b0fe4dc31149dd29cb87a2c81
Ports: 8123/TCP, 8124/TCP, 8091/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/opt/flink/scripts/start.sh
Args:
jobmanager
$(POD_IP)
State: Running
Started: Wed, 16 Mar 2022 01:46:12 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Liveness: tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
Readiness: tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
Environment:
POD_IP: (v1:status.podIP)
POD_NAME: deployment-flink-jobmanager-57b59994f8-4lqw6 (v1:metadata.name)
JVM_ARGS: -Xms1024m -Xmx4096m -XX:MetaspaceSize=256M
Mounts:
/opt/flink/conf from flink-config-volume (rw)
/opt/flink/log from flink-jobmanager-log-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from serviceaccount-test-token-8k8z2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
flink-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: flink-config
Optional: false
flink-jobmanager-log-dir:
Type: HostPath (bare host directory volume)
Path: /opt/container/flink/jobmanager/logs
HostPathType:
serviceaccount-test-token-8k8z2:
Type: Secret (a volume populated by a Secret)
SecretName: serviceaccount-test-token-8k8z2
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39s default-scheduler Successfully assigned test/deployment-flink-jobmanager-7bc59d769-rttvv to node-3
Normal Pulling 65s kubelet Pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00"
Normal Pulled 65s kubelet Successfully pulled image "192.168.0.60:5000/test/flink:2022.0221.1542.00" in 113.582103ms
Normal Created 65s kubelet Created container jobmanager
Normal Started 65s kubelet Started container jobmanager
2.Pending 状态处理
2.1 查看 Pending 状态的故障详情
#查看Pending状态的pod
$kubectl get pod |grep Pending
deployment-flink-jobmanager-7c879b9649-2tmj9 0/1 Pending 0 58s
#查看异常Pod的事件Events
$kubectl describe pod deployment-flink-jobmanager-7c879b9649-2tmj9
Name: deployment-flink-jobmanager-7c879b9649-2tmj9
Namespace: default
Priority: 0
Node: <none>
Labels: app=flink
component=jobmanager
pod-template-hash=7c879b9649
Annotations: metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/deployment-flink-jobmanager-7c879b9649
Containers:
jobmanager:
Image: 192.168.0.60:5000/test/flink:2022.0221.1542.00
省略......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 103s default-scheduler 0/28 nodes are available: 25 node(s) didn't match Pod's node affinity, 3 Insufficient memory.
Warning FailedScheduling 103s default-scheduler 0/28 nodes are available: 25 node(s) didn't match Pod's node affinity, 3 Insufficient memory.
2.2 Pending 状态的常见原因
① 资源不足:集群或者 Pod
所打标签的 Node
资源不足(CPU 或者 内存)。如上述示例中的 deployment-flink-jobmanager-7c879b9649-2tmj9
,Pod
所在节点的内存不足,需要减少 Pod
请求的内存或者增加 Node
的内存,甚至增加新节点同时新节点增加标签。关于调整 Pod 资源可以参考计算资源文档。
② 使用了 hostPort
:如果绑定 Pod
到 hostPort
,那么能够运行该 Pod
的节点就有限了。 多数情况下,hostPort
是非必要的,而应该采用 Service
对象来暴露 Pod
。
3.ContainerCreating 或者 Waiting 状态处理
3.1 查看 Waiting 状态的故障详情
#查看ContainerCreating或者Waiting的Pod
$kubectl get pod |grep ContainerCreating
deployment-flink-jobmanager-7bc59d769-7xqz7 0/1 ContainerCreating 0 37s
#查看ContainerCreating或者Waiting状态的事件Events
$kubectl describe pod deployment-flink-jobmanager-7bc59d769-7xqz7
Name: deployment-flink-jobmanager-7bc59d769-7xqz7
Namespace: default
Priority: 0
Node: node-3/192.168.0.248
Start Time: Thu, 17 Mar 2022 07:27:51 +0000
Labels: app=flink
component=jobmanager
pod-template-hash=7bc59d769
Annotations: metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/deployment-flink-jobmanager-7bc59d769
Containers:
jobmanager:
Container ID:
Image: 192.168.0.60:5000/test/flink:2022.0221.1542.00
Image ID:
Ports: 8123/TCP, 8124/TCP, 8091/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/opt/flink/scripts/start.sh
Args:
jobmanager
$(POD_IP)
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Liveness: tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
Readiness: tcp-socket :8123 delay=30s timeout=10s period=20s #success=1 #failure=5
省略......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m40s default-scheduler Successfully assigned default/deployment-flink-jobmanager-7bc59d769-7xqz7 to node-3
Warning FailedMount 66s kubelet Unable to attach or mount volumes: unmounted volumes=[flink-config-volume], unattached volumes=[flink-config-volume rtacomposer-volume flink-jobmanager-log-dir serviceaccount-token-8k8z2]: timed out waiting for the condition
Warning FailedMount 61s (x9 over 3m8s) kubelet MountVolume.SetUp failed for volume "flink-config-volume" : configmap "flink-config" not found
3.2 ContainerCreating 或者 Waiting 状态的常见原因
① 挂载 Volume 失败
例如,挂载本地磁盘、Configmap
、Secret
等失败
② 磁盘爆满
启动 Pod 会调 CRI 接口创建容器,容器运行时创建容器时通常会在数据目录下为新建的容器创建一些目录和文件,如果数据目录所在的磁盘空间满了就会创建失败并报错:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 2m (x4307 over 16h) kubelet, 10.179.80.31 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "apigateway-6dc48bf8b6-l8xrw": Error response from daemon: mkdir /var/lib/docker/aufs/mnt/1f09d6c1c9f24e8daaea5bf33a4230de7dbc758e3b22785e8ee21e3e3d921214-init: no space left on device
③ Pod 设置的 limit 太小或者单位不对
如果 limit 设置过小以至于不足以成功运行 Sandbox 也会造成这种状态,常见的是因为 memory limit 单位设置不对造成的 limit 过小,比如误将 memory 的 limit 单位像 request 一样设置为小写 m
,这个单位在 memory 不适用,会识别成 byte
, 应该用 Mi
或 M
。
to start sandbox container for pod ... Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"signal: killed\"": unknown
④ CNI 网络错误
如果发生 CNI 网络错误通常需要检查下网络插件的配置和运行状态,如果没有正确配置或正常运行通常表现为:无法配置 Pod 网络、无法分配 Pod IP。
4.镜像 Image 异常状态处理
4.1 查看 Image 异常状态的故障详情
#查看Image异常状态的pod
$kubectl get pod |grep 0/1
deployment-flink-jobmanager-7bc59d769-586rv 0/1 ImagePullBackOff 0 65s
#查看Image异常状态的事件Events
$kubectl describe pod deployment-flink-jobmanager-7bc59d769-586rv
Name: deployment-flink-jobmanager-7bc59d769-586rv
Namespace: default
Priority: 0
Node: <none>
Labels: app=flink
component=jobmanager
pod-template-hash=7bc59d769
Annotations: metrics.alpha.kubernetes.io/custom-endpoints: [{"api":"prometheus", "path":"/metrics", "port":"8080"}]
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/deployment-flink-jobmanager-7bc59d769
Containers:
jobmanager:
Image: 192.168.0.60:5000/test/flink:2022.0221.1542.00
省略......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 72s default-scheduler Successfully assigned default/deployment-flink-jobmanager-7bc59d769-586rv to node-3
Normal Pulling 57s (x3 over 100s) kubelet Pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86"
Warning Failed 57s (x3 over 100s) kubelet Failed to pull image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86": rpc error: code = Unknown desc = Error response from daemon: manifest for 192.168.0.60:5000/test/flink:2022.0221.1542.00_x86 not found
Warning Failed 57s (x3 over 100s) kubelet Error: ErrImagePull
Normal BackOff 31s (x4 over 99s) kubelet Back-off pulling image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86"
Warning Failed 31s (x4 over 99s) kubelet Error: ImagePullBackOff
4.2 Image 异常状态的常见原因
① 私有仓地址未加入到 insecure-registry
下面以 Docker 为实例,首先进入部署 Pod
的所在 Node
,然后编辑如下的 daemon.json 文件vi /etc/docker/daemon.json
,新增字段 insecure-registries
添加本地私有仓 192.168.0.60:5000
,最后 reload dockerd 生效。
{
"registry-mirrors": ["https://r9xxm8z8.mirror.aliyuncs.com","https://registry.docker-cn.com"],
"insecure-registries":["192.168.0.60:5000"],
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 1000000,
"Soft": 1000000
}
},
"log-driver":"json-file",
"log-opts": {"max-size":"10m", "max-file":"5"}
}
#reload dockerd生效配置
$sudo systemctl enable docker
$sudo systemctl daemon-reload
$sudo systemctl restart docker
② 如果 registry 的仓库地址是自签发证书的 https,则 Node
需要添加 CA 证书
将 registry 的 ca 证书放置到 /etc/docker/certs.d/<address>/ca.crt
位置,例如,/etc/docker/certs.d/registry.access.test.com/ca.crt
。
③ 私有镜像仓库认证失败
如果 registry 需要认证,但是 Pod 没有配置 imagePullSecret,配置的 Secre
不存在或者有误都会认证失败,参考文章 k8s 的 imagePullSecrets 如何生成及使用。
④ 镜像文件损坏
如果镜像文件损坏了,拉取下来也用不了,需要重新制作镜像并上传。
⑤ 镜像拉取超时
如果节点上同时启动太多 Pod
,就会有许多可能会造成容器镜像下载排队。如果前面排队许多的 Pod
需要下载大镜像,则下载很长时间导致后面排队的 Pod
就会报拉取镜像超时。参考 Kubelet 命令可以设置是否串行拉取镜像及其速率。
--serialize-image-pulls 默认值:true
--registry-qps int32 默认值:5
⑥ 镜像不不存在
查询 Pod
详情 kubectl describe pod deployment-flink-jobmanager-7bc59d769-586rv
,可知事件:
Events:
....
Warning Failed 57s (x3 over 100s) kubelet Failed to pull image "192.168.0.60:5000/test/flink:2022.0221.1542.00_x86": rpc error: code = Unknown desc = Error response from daemon: manifest for 192.168.0.60:5000/test/flink:2022.0221.1542.00_x86 not found
5.Crashing 状态处理
Pod
如果处于 CrashLoopBackOff
状态说明之前是启动了,但是运行过程中异常退出了。只要 Pod
的 restartPolicy 不是 Never 就可能被重启拉起,此时 Pod 的 RestartCounts
通常是大于 0。因此,可以查看容器进程的退出状态来缩小问题范围。
Crashing 的常见原因
① 容器进程主动退出
② 容器 OOM
6.Running 状态但没有正常工作
如果 Pod
行为不符合预期,很可能 Pod
描述(例如 mypod.yaml)中有问题, 并且该错误在创建 Pod
时被忽略掉,没有报错。 通常,Pod
的定义中节区嵌套关系错误、字段名字拼错的情况都会引起对应内容被忽略掉。 例如,如果误将 command
写成 commnd,Pod
虽然可以创建,但不会执行你期望它执行的命令行。
① 利用 --validate
校验部署的 yaml
首先删除运行中的 Pod
,然后 --validate
重新创建 Pod
,例如,kubectl apply --validate -f mypod.yaml
。如果 command
错误写成 commnd,将会看到下面的错误信息:
I0805 10:43:25.129850 46757 schema.go:126] unknown field: commnd
I0805 10:43:25.129973 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/6842
pods/mypod
② 手工对比本机和环境的 yaml
首先导出环境的 yaml,例如,kubectl get pods/mypod -o yaml > mypod-on-k8s.yaml
,然后利用工具 Beyond Compare 对比 yaml。