【已解决】k8s Cronjob.spec.failedJobs

2020-11-03 本文已影响0人王小奕

背景介绍

如下面的yaml所示，明明已经设置了.spec.failedJobsHistoryLimit为1，但仍然产生了7个状态为Error的Pod：

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: mycronjob
  namespace: prod
  labels:
    task: processor
spec:
  failedJobsHistoryLimit: 1
  successfulJobsHistoryLimit: 3
……

kubectl get pod -n prod -l task=processor
NAME                      READY   STATUS   RESTARTS   AGE
mycronjob-16043364027mpp   0/1     Error    0          9h
mycronjob-16043364098q8q   0/1     Error    0          9h
mycronjob-160433640hc2ch   0/1     Error    0          9h
mycronjob-160433640nrdqb   0/1     Error    0          9h
mycronjob-160433640r49cq   0/1     Error    0          8h
mycronjob-160433640tnfvw   0/1     Error    0          9h
mycronjob-160433640vhdsc   0/1     Error    0          9h

那么，问题来了，为什么CronJob.spec.successfulJobsHistoryLimit可以生效，而CronJob.spec.failedJobsHistoryLimit没有生效呢？

分析

理解这个问题前，我们首先要搞清楚，CronJob是干什么的。
官方介绍

A CronJob creates Jobs on a repeating schedule.

One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.

从定义中，我们不难看出，CronJob是用来管理Job的，而Job才是生成Pod的源头，因此想要探寻CronJob.spec.failedJobsHistoryLimit失效的原因，我们得去看CronJob定期创建的Job的配置：
执行命令：

kubectl get job -n prod -l task=processor -o yaml

得到：

apiVersion: v1
items:
- apiVersion: batch/v1
  kind: Job
  metadata:
    labels:
      task: processor
    name: processor-1604336400
    namespace: prod
    ownerReferences:
    - apiVersion: batch/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CronJob
      name: processor
  spec:
    backoffLimit: 6
    completions: 1
    parallelism: 1
  status:
    conditions:
    - message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      type: Failed

注意观察spec.backoffLimit这个配置，官方解释是：

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

翻译过来就是说，Job处理过程中，如果它创建的Pod失败了，那么默认情况下，Job会重复创建6次新的Pod，如果我们不想它创建这么多次，可以更改.spec.backoffLimit这个配置。
讲到这里，相信大家都知道问题出在哪儿了。

总结

CronJob创建了Job，并且根据我们的配置，限制了Job的失败以及成功历史输分别为3和1，但是Job什么时候算失败确是由Job.spec.backoffLimit规定的，因此我们通过CronJob.spec.failedJobsHistoryLimit限制的只能是Job的个数，此个数可以通过命令kubectl get job -n prod -l task=processor查看，想要限制最终的失败Pod数，得控制Job.spec.backoffLimit这个配置才可以。

参考

Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle

思考

如果设置CronJob.spec.failedJobsHistoryLimit为2，Job.spec.backoffLimit为5，那么最多会保留多少个状态为Error的Pod ?

【已解决】k8s Cronjob.spec.failedJobs

标签

背景介绍

分析

总结

参考

思考

猜你喜欢

热点阅读