How to debug a pod on kubernetes

2019-10-22 本文已影响0人 Lis_

在kubernetes中如何debug一个运行失败的pod？首先可以过滤出非Running状态的podkubectl get pods --all-namespaces | grep -iv Running，pod最常见的错误状态是CrashLoopBackOff，这表示着这个pod在启动之后恰好crashes了，kubernetes接着尝试再去启动这个pod，但是pod最终还是启动失败了。

Pod Crash 可能的原因

在Pull image的时候出现错误，错误的或者丢失了 secrets或者image；
应用运行时错误，比如没有缺少环境变量或者ConfigMaps Secrets；
Liveness probe 检查失败；
资源消耗太高（Mem,CPU）或者是太严格的资源限制；
PV没有创建出来或者没有mount成功；
容器的image没有更新。
通常，可以使用kubectl logs ...或者kubectl describe...加上对应的参数就可以获得一些失败的信息。通过kubectl logs --help可以得到命令的具体参数如何使用。
注：即使你的Pod处于running的状态，如果Restarts的次数太多，这也表示你的Pod可能存在潜在的问题。

错误的image名字导致Pod运行失败

可以通过kubectl describe pod <your-pod> <your-namespace>来获得更多的信息。
在Events项，会提示错误信息Failed to pull image...和Reason: Failed。此时Pod的状态是ImagePullBackOff。
创建一个Pod

apiVersion: v1
kind: Pod 
metadata:
  name: termination-demo
spec:
  containers:
  - name: termination-demo-container
    image: debiann
    command: ["/bin/sh"]
    args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]

# kubectl get pods
NAME                               READY   STATUS         RESTARTS   AGE
termination-demo                   0/1     ErrImagePull   0          4s

# kubectl describe pods termination-demo
...
Events:
  Type     Reason     Age                From                     Message
  ----     ------     ----               ----                     -------
  Normal   Scheduled  72s                default-scheduler        Successfully assigned default/termination-demo to 172.16.219.186
  Normal   Pulling    31s (x3 over 71s)  kubelet, 172.16.219.186  pulling image "debiann"
  Warning  Failed     30s (x3 over 70s)  kubelet, 172.16.219.186  Failed to pull image "debiann": rpc error: code = Unknown desc = Error response from daemon: pull access denied for debiann, repository does not exist or may require 'docker login'
  Warning  Failed     30s (x3 over 70s)  kubelet, 172.16.219.186  Error: ErrImagePull
  Normal   BackOff    6s (x4 over 69s)   kubelet, 172.16.219.186  Back-off pulling image "debiann"
  Warning  Failed     6s (x4 over 69s)   kubelet, 172.16.219.186  Error: ImagePullBackOff

丢失ConfigMap或者Secrets

创建Pod

apiVersion: apps/v1
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]

# kubectl get pods
NAME                                READY   STATUS             RESTARTS   AGE
termination-demo-6654b86785-vf9bx   0/1     CrashLoopBackOff   2          41s

# kubectl describe pods termination-demo-6654b86785-vf9bx
......
Events:
  Type     Reason     Age                From                     Message
  ----     ------     ----               ----                     -------
  Normal   Scheduled  69s                default-scheduler        Successfully assigned default/termination-demo-6654b86785-vf9bx to 172.16.219.186
  Normal   Pulling    16s (x4 over 68s)  kubelet, 172.16.219.186  pulling image "debian"
  Normal   Pulled     15s (x4 over 63s)  kubelet, 172.16.219.186  Successfully pulled image "debian"
  Normal   Created    14s (x4 over 62s)  kubelet, 172.16.219.186  Created container
  Normal   Started    14s (x4 over 62s)  kubelet, 172.16.219.186  Started container
  Warning  BackOff    1s (x8 over 59s)   kubelet, 172.16.219.186  Back-off restarting failed container

# kubectl logs termination-demo-6654b86785-vf9bx
/bin/sh: 1: cannot open : No such file

没有如何提示错误的信息，在这个pod中其实是缺少一个ConfigMap,手动创建一个ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-env
data:
  MYFILE: "/etc/profile"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
        envFrom:
        - configMapRef:
            name: app-env

# kubectl apply -f configmap.yaml
configmap/app-env created
deployment.apps/termination-demo configured

当加入ConfigMap以后，你会发现Pod的状态依旧是CrashLoopBackOff的，这是因为当应用执行完sed命令以后，Pod就运行完毕了，这不是一个long running service，为了让Pod保持一直运行，可以加一个一直运行的脚本

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-env
data:
  MYFILE: "/etc/profile"
  SLEEP: "5"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        # args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
        args: ["-c", "while true; do sleep $SLEEP; echo sleeping; done;"]
        envFrom:
        - configMapRef:
            name: app-env

资源限制

在定义一个pod时，你可以会指定应用可使用的资源如Mem或者CPU，如果没有定义这些限制，那系统会使用默认的资源配置，CPU：0m (in Milli CPU) ， RAM: 0Gi 表示节点本身没有任何限制。
如果你的应用需要更多的资源，kubernetes会在requests和limit之间权衡，request指定保证的资源总量，limit告诉kubernetes容器可能需要的最大的资源的数量，他们之间的关系可以表示成0 <= requests <= limit，对于这两种设置，你都需要考虑可用节点提供的资源总量。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
        resources:
          requests:
            cpu: "600m"

$ kubectl describe po termination-demo-fdb7bb7d9-mzvfw
Name:           termination-demo-fdb7bb7d9-mzvfw
Namespace:      default
...
Containers:
  termination-demo-container:
    Image:      debian
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      sleep 10 && echo Sleep expired > /dev/termination-log
    Requests:
      cpu:        6
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-t549m (ro)
Conditions:
  Type           Status
  PodScheduled   False
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x7 over 40s)  default-scheduler  0/2 nodes are available: 2 Insufficient cpu.

Image没有更新

假如你在你的应用加入了新的fix，重新build出image并且push到镜像仓库中，在你部署了应用后，容器并没有Running起来。这个问题取决于你在kubernetes中如何定义image的使用策略。
如果你没有更改image的tag，则默认image策略IfNotPresent会告诉Kubernetes使用缓存的image。
最佳做法是，无论何时更改image中的任何内容，都不应使用最新tag并更改image的tag。