【K8s 精选】如何定位镜像和 yaml 部署问题
2022-02-14 本文已影响0人
熊本极客
步骤一:查看 pod event
#如果有明确的异常事件,例如资源不足。即如果马上定位出问题,就不需要继续下面步骤了。
#方法1
$kubectl describe pod deployment-flink-jobmanager-f766989b9-dth5v -niot test
#方法2
$kubectl get events -A |grep deployment-flink-jobmanager-f766989b9-dth5v
步骤二:kubectl logs 查看容器日志
#如果有明确的异常事件,例如启动脚本异常退出。即如果马上定位出问题,就不需要继续下面步骤了。
$kubectl logs deployment-flink-jobmanager-f766989b9-dth5v -niot test
Error from server (BadRequest): container "jobmanager" in pod "deployment-flink-jobmanager-f766989b9-dth5v" is waiting to start: PodInitializing
#Pod 无法启动导致没有日志,需要进一步查看 kubelet 日志。
步骤三:登录容器所在的 node,利用 journalctl 查看 kubelet 日志
#该日志没有明确的错误,只是显示了启动容器 StartContainer 失败,CrashLoopBackOff 状态,即部署 yaml 没有异常。因此需要进一步利用 docker run 定位镜像的问题。
$journalctl -u kubelet |grep deployment-flink-jobmanager-f766989b9-dth5v
Feb 11 02:08:43 iota-node-3 kubelet[57638]: I0211 02:08:43.239559 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flink-jobmanager-log-dir" (UniqueName: "kubernetes.io/host-path/db4c58cb-2cec-4c23-b7e9-bd9adfded8b1-flink-jobmanager-log-dir") pod "deployment-flink-jobmanager-f766989b9-dth5v" (UID: "db4c58cb-2cec-4c23-b7e9-bd9adfded8b1")
Feb 11 02:08:43 iota-node-3 kubelet[57638]: I0211 02:08:43.239637 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flink-tools-dir" (UniqueName: "kubernetes.io/host-path/db4c58cb-2cec-4c23-b7e9-bd9adfded8b1-flink-tools-dir") pod "deployment-flink-jobmanager-f766989b9-dth5v" (UID: "db4c58cb-2cec-4c23-b7e9-bd9adfded8b1")
Feb 11 02:08:43 iota-node-3 kubelet[57638]: I0211 02:08:43.239753 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flink-config-volume" (UniqueName: "kubernetes.io/configmap/db4c58cb-2cec-4c23-b7e9-bd9adfded8b1-flink-config-volume") pod "deployment-flink-jobmanager-f766989b9-dth5v" (UID: "db4c58cb-2cec-4c23-b7e9-bd9adfded8b1")
Feb 11 02:08:43 iota-node-3 kubelet[57638]: I0211 02:08:43.239801 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "serviceaccount-iota-token-crpqb" (UniqueName: "kubernetes.io/secret/db4c58cb-2cec-4c23-b7e9-bd9adfded8b1-serviceaccount-iota-token-crpqb") pod "deployment-flink-jobmanager-f766989b9-dth5v" (UID: "db4c58cb-2cec-4c23-b7e9-bd9adfded8b1")
Feb 11 02:09:16 iota-node-3 kubelet[57638]: E0211 02:09:16.177804 57638 pod_workers.go:191] Error syncing pod db4c58cb-2cec-4c23-b7e9-bd9adfded8b1 ("deployment-flink-jobmanager-f766989b9-dth5v_iot(db4c58cb-2cec-4c23-b7e9-bd9adfded8b1)"), skipping: failed to "StartContainer" for "jobmanager" with CrashLoopBackOff: "back-off 10s restarting failed container=jobmanager pod=deployment-flink-jobmanager-f766989b9-dth5v_iot(db4c58cb-2cec-4c23-b7e9-bd9adfded8b1)"
#如果有明确的异常事件,如下 yaml 的 initContainers 异常
Feb 14 08:10:48 iota-node-3 kubelet[57638]: I0214 08:10:48.685536 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flink-jobmanager-log-dir" (UniqueName: "kubernetes.io/host-path/ec0429a1-5774-466b-a5b0-699ca9353b61-flink-jobmanager-log-dir") pod "deployment-flink-jobmanager-689bc4b45f-tpvjb" (UID: "ec0429a1-5774-466b-a5b0-699ca9353b61")
Feb 14 08:10:48 iota-node-3 kubelet[57638]: I0214 08:10:48.685574 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flink-config-volume" (UniqueName: "kubernetes.io/configmap/ec0429a1-5774-466b-a5b0-699ca9353b61-flink-config-volume") pod "deployment-flink-jobmanager-689bc4b45f-tpvjb" (UID: "ec0429a1-5774-466b-a5b0-699ca9353b61")
Feb 14 08:10:48 iota-node-3 kubelet[57638]: I0214 08:10:48.685698 57638 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "serviceaccount-iota-token-crpqb" (UniqueName: "kubernetes.io/secret/ec0429a1-5774-466b-a5b0-699ca9353b61-serviceaccount-iota-token-crpqb") pod "deployment-flink-jobmanager-689bc4b45f-tpvjb" (UID: "ec0429a1-5774-466b-a5b0-699ca9353b61")
Feb 14 08:10:50 iota-node-3 kubelet[57638]: E0214 08:10:50.660663 57638 kuberuntime_container.go:706] failed to remove pod init container "init-flink-home-dir": rpc error: code = Unknown desc = failed to remove container "e594274677f05e878b9133c4ead97d6eb91f240616540f9732d158c07a648cc7": Error response from daemon: removal of container e594274677f05e878b9133c4ead97d6eb91f240616540f9732d158c07a648cc7 is already in progress; Skipping pod "deployment-flink-jobmanager-689bc4b45f-tpvjb_iot(ec0429a1-5774-466b-a5b0-699ca9353b61)"
Feb 14 08:10:50 iota-node-3 kubelet[57638]: E0214 08:10:50.660941 57638 pod_workers.go:191] Error syncing pod ec0429a1-5774-466b-a5b0-699ca9353b61 ("deployment-flink-jobmanager-689bc4b45f-tpvjb_iot(ec0429a1-5774-466b-a5b0-699ca9353b61)"), skipping: failed to "StartContainer" for "init-flink-home-dir" with CrashLoopBackOff: "back-off 10s restarting failed container=init-flink-home-dir pod=deployment-flink-jobmanager-689bc4b45f-tpvjb_iot(ec0429a1-5774-466b-a5b0-699ca9353b61)"
步骤四:登录容器所在的 node,利用 docker run 运行容器
$docker run -it 192.168.0.60:5000/test/flink:2022.0211.1104.57 bash
bash-5.0$ cd /opt/test/flink/
bash-5.0$ ls -l
total 48
drwxrwxrwx 1 flink flink 4096 Feb 11 06:49 bin
drwxrwxrwx 1 flink flink 4096 Jun 15 2021 conf
drwxrwxrwx 1 flink flink 4096 Jun 15 2021 examples
drwxrwxrwx 1 flink flink 4096 Feb 11 06:49 lib
-rwxrwxrwx 1 flink flink 11558 Apr 29 2021 LICENSE
drwxrwxrwx 1 flink flink 4096 Apr 29 2021 log
drwxrwxrwx 1 flink flink 4096 Jun 15 2021 opt
drwxrwxrwx 1 flink flink 4096 Jun 15 2021 plugins
-rwxrwxrwx 1 flink flink 1341 Apr 29 2021 README.txt
drwxr--r-- 1 flink flink 4096 Feb 11 06:49 scripts
#说明:镜像的 dockerfile 使用 USER paas,而 docker run 进入容器后,发现目录的权限为 flink。因此,镜像有问题,需要在 dockerfile 修改目录权限。