【已解决】k8s Cronjob.spec.failedJobs
标签
kubernetes
、Cronjob
、pod
背景介绍
如下面的yaml
所示,明明已经设置了.spec.failedJobsHistoryLimit
为1,但仍然产生了7个状态为Error的Pod:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: mycronjob
namespace: prod
labels:
task: processor
spec:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
……
kubectl get pod -n prod -l task=processor
NAME READY STATUS RESTARTS AGE
mycronjob-16043364027mpp 0/1 Error 0 9h
mycronjob-16043364098q8q 0/1 Error 0 9h
mycronjob-160433640hc2ch 0/1 Error 0 9h
mycronjob-160433640nrdqb 0/1 Error 0 9h
mycronjob-160433640r49cq 0/1 Error 0 8h
mycronjob-160433640tnfvw 0/1 Error 0 9h
mycronjob-160433640vhdsc 0/1 Error 0 9h
那么,问题来了,为什么CronJob.spec.successfulJobsHistoryLimit
可以生效,而CronJob.spec.failedJobsHistoryLimit
没有生效呢?
分析
理解这个问题前,我们首先要搞清楚,CronJob是干什么的。
官方介绍
A CronJob creates Jobs on a repeating schedule.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
从定义中,我们不难看出,CronJob
是用来管理Job
的,而Job
才是生成Pod
的源头,因此想要探寻CronJob.spec.failedJobsHistoryLimit
失效的原因,我们得去看CronJob
定期创建的Job
的配置:
执行命令:
kubectl get job -n prod -l task=processor -o yaml
得到:
apiVersion: v1
items:
- apiVersion: batch/v1
kind: Job
metadata:
labels:
task: processor
name: processor-1604336400
namespace: prod
ownerReferences:
- apiVersion: batch/v1beta1
blockOwnerDeletion: true
controller: true
kind: CronJob
name: processor
spec:
backoffLimit: 6
completions: 1
parallelism: 1
status:
conditions:
- message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
type: Failed
注意观察spec.backoffLimit
这个配置,官方解释是:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.
翻译过来就是说,Job
处理过程中,如果它创建的Pod
失败了,那么默认情况下,Job
会重复创建6次新的Pod
,如果我们不想它创建这么多次,可以更改.spec.backoffLimit
这个配置。
讲到这里,相信大家都知道问题出在哪儿了。
总结
CronJob
创建了Job
,并且根据我们的配置,限制了Job
的失败以及成功历史输分别为3和1,但是Job
什么时候算失败确是由Job.spec.backoffLimit
规定的,因此我们通过CronJob.spec.failedJobsHistoryLimit
限制的只能是Job
的个数,此个数可以通过命令kubectl get job -n prod -l task=processor
查看,想要限制最终的失败Pod
数,得控制Job.spec.backoffLimit
这个配置才可以。
参考
Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle
思考
如果设置CronJob.spec.failedJobsHistoryLimit
为2,Job.spec.backoffLimit
为5,那么最多会保留多少个状态为Error的Pod ?