【已解决】k8s Cronjob.spec.failedJobs
标签
kubernetes、Cronjob、pod
背景介绍
如下面的yaml所示,明明已经设置了.spec.failedJobsHistoryLimit为1,但仍然产生了7个状态为Error的Pod:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: mycronjob
namespace: prod
labels:
task: processor
spec:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
……
kubectl get pod -n prod -l task=processor
NAME READY STATUS RESTARTS AGE
mycronjob-16043364027mpp 0/1 Error 0 9h
mycronjob-16043364098q8q 0/1 Error 0 9h
mycronjob-160433640hc2ch 0/1 Error 0 9h
mycronjob-160433640nrdqb 0/1 Error 0 9h
mycronjob-160433640r49cq 0/1 Error 0 8h
mycronjob-160433640tnfvw 0/1 Error 0 9h
mycronjob-160433640vhdsc 0/1 Error 0 9h
那么,问题来了,为什么CronJob.spec.successfulJobsHistoryLimit可以生效,而CronJob.spec.failedJobsHistoryLimit没有生效呢?
分析
理解这个问题前,我们首先要搞清楚,CronJob是干什么的。
官方介绍
A CronJob creates Jobs on a repeating schedule.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
从定义中,我们不难看出,CronJob是用来管理Job的,而Job才是生成Pod的源头,因此想要探寻CronJob.spec.failedJobsHistoryLimit失效的原因,我们得去看CronJob定期创建的Job的配置:
执行命令:
kubectl get job -n prod -l task=processor -o yaml
得到:
apiVersion: v1
items:
- apiVersion: batch/v1
kind: Job
metadata:
labels:
task: processor
name: processor-1604336400
namespace: prod
ownerReferences:
- apiVersion: batch/v1beta1
blockOwnerDeletion: true
controller: true
kind: CronJob
name: processor
spec:
backoffLimit: 6
completions: 1
parallelism: 1
status:
conditions:
- message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
type: Failed
注意观察spec.backoffLimit这个配置,官方解释是:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.
翻译过来就是说,Job处理过程中,如果它创建的Pod失败了,那么默认情况下,Job会重复创建6次新的Pod,如果我们不想它创建这么多次,可以更改.spec.backoffLimit这个配置。
讲到这里,相信大家都知道问题出在哪儿了。
总结
CronJob创建了Job,并且根据我们的配置,限制了Job的失败以及成功历史输分别为3和1,但是Job什么时候算失败确是由Job.spec.backoffLimit规定的,因此我们通过CronJob.spec.failedJobsHistoryLimit限制的只能是Job的个数,此个数可以通过命令kubectl get job -n prod -l task=processor查看,想要限制最终的失败Pod数,得控制Job.spec.backoffLimit这个配置才可以。
参考
Running Automated Tasks with a CronJob
Jobs
Pod Lifecycle
思考
如果设置CronJob.spec.failedJobsHistoryLimit为2,Job.spec.backoffLimit为5,那么最多会保留多少个状态为Error的Pod ?