KubeJobFailed #
Meaning #
[This alert][KubeJobFailed] is triggered for the case that the number of job
execution attempts exceeds the backoffLimit
. A job can therefore create one or
many pods for its tasks.
Impact #
A task has not finished correctly. Depending on the task, this has a different impact.
Diagnosis #
The alert should contain the job name and the namespace where that job failed. Follow the particular mitigation steps according to that. For example:
- alertname = KubeJobFailed
...
- job_name = elasticsearch-delete-app-1600903800
- namespace = openshift-logging
...
- message = Job openshift-logging/elasticsearch-delete-app-1600855200 failed to complete.
Mitigation #
Find the pods that belong to that job:
$ kubectl get pod -n $NAMESPACE -l job-name=$JOBNAME
If you see an Error
pod with a subsequent Completed
pod of
the same base name, the error was transient, and the Error
pod can safely be deleted.
Have a look at the jobs themselves:
$ kubectl get jobs -n $NAMESPACE
If there is a healthy job for every failed one, it is safe to delete the failed jobs. The alert should resolve itself after a few minutes.