自留地

【已解决】k8s Cronjob.spec.failedJobs

2020-11-03  本文已影响0人  王小奕

标签

kubernetesCronjobpod

背景介绍

如下面的yaml所示,明明已经设置了.spec.failedJobsHistoryLimit为1,但仍然产生了7个状态为Error的Pod:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: mycronjob
  namespace: prod
  labels:
    task: processor
spec:
  failedJobsHistoryLimit: 1
  successfulJobsHistoryLimit: 3
……
kubectl get pod -n prod -l task=processor
NAME                      READY   STATUS   RESTARTS   AGE
mycronjob-16043364027mpp   0/1     Error    0          9h
mycronjob-16043364098q8q   0/1     Error    0          9h
mycronjob-160433640hc2ch   0/1     Error    0          9h
mycronjob-160433640nrdqb   0/1     Error    0          9h
mycronjob-160433640r49cq   0/1     Error    0          8h
mycronjob-160433640tnfvw   0/1     Error    0          9h
mycronjob-160433640vhdsc   0/1     Error    0          9h

那么,问题来了,为什么CronJob.spec.successfulJobsHistoryLimit可以生效,而CronJob.spec.failedJobsHistoryLimit没有生效呢?

分析

理解这个问题前,我们首先要搞清楚,CronJob是干什么的。
官方介绍

A CronJob creates Jobs on a repeating schedule.

One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.

从定义中,我们不难看出,CronJob是用来管理Job的,而Job才是生成Pod的源头,因此想要探寻CronJob.spec.failedJobsHistoryLimit失效的原因,我们得去看CronJob定期创建的Job的配置:
执行命令:

kubectl get job -n prod -l task=processor -o yaml

得到:

apiVersion: v1
items:
- apiVersion: batch/v1
  kind: Job
  metadata:
    labels:
      task: processor
    name: processor-1604336400
    namespace: prod
    ownerReferences:
    - apiVersion: batch/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CronJob
      name: processor
  spec:
    backoffLimit: 6
    completions: 1
    parallelism: 1
  status:
    conditions:
    - message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      type: Failed

注意观察spec.backoffLimit这个配置,官方解释是:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

翻译过来就是说,Job处理过程中,如果它创建的Pod失败了,那么默认情况下,Job会重复创建6次新的Pod,如果我们不想它创建这么多次,可以更改.spec.backoffLimit这个配置。
讲到这里,相信大家都知道问题出在哪儿了。

总结

CronJob创建了Job,并且根据我们的配置,限制了Job的失败以及成功历史输分别为3和1,但是Job什么时候算失败确是由Job.spec.backoffLimit规定的,因此我们通过CronJob.spec.failedJobsHistoryLimit限制的只能是Job的个数,此个数可以通过命令kubectl get job -n prod -l task=processor查看,想要限制最终的失败Pod数,得控制Job.spec.backoffLimit这个配置才可以。

参考

Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle

思考

如果设置CronJob.spec.failedJobsHistoryLimit为2,Job.spec.backoffLimit为5,那么最多会保留多少个状态为Error的Pod ?

上一篇下一篇

猜你喜欢

热点阅读