Add --pod-pending-timeout config option#832
Merged
Merged
Conversation
Pending pod timeouts allows for the controller to fail a kubernetes job that has been pending for the configured timeout duration, such as mimatched node selectors, resource constraints, etc. Uses pod.CreatedTime to determine when a pending pod should be removed and the associated buildkite job failed by the controller. Adds a new config option: --pod-pending-timeout, with a default of 5 minutes.
to cover pod-pending-timeout config option and update default to 15 minutes from 5.
DrJosh9000
reviewed
Mar 12, 2026
Contributor
DrJosh9000
left a comment
There was a problem hiding this comment.
This is very good! I have some stylistic comments.
Co-authored-by: Josh Deprez <jd@buildkite.com>
Co-authored-by: Josh Deprez <jd@buildkite.com>
Co-authored-by: Josh Deprez <jd@buildkite.com>
Contributor
Author
|
Thank you @DrJosh9000! Applied the suggestions and learned some things 💪 Also removed a test case that is no longer needed due to the change removing the explicit check for |
In this particular case, we were testing for IsZero() to return `true`. However, with the changes to evaluate time.Since since a zero time will always return the max duration value, which is always greater than or equal to the timeout. In practice, a pod also shouldn't ever have a zero timestamp: 1. Zero timestamps never occur in real-life scenarios (Kubernetes API server always sets CreationTimestamp) 2. The scenario only exists in manually constructed test objects 3. If it did somehow happen, treating it as timed out (current behavior) is reasonable
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, the controller has a handful of different levers to control when a pod will get terminated, specifically when a pod has been stuck in
Pendingfor a long time:EmptyJobGracePeriodis set for a job that has started but never created a podImagePullBackOffGracePeriodis for when the pods are created but stuck in theImagePullBackOffstateJobActiveDeadlineSecondswould handle pretty much every other case, like those pods that are generally pending due to insufficient cluster resources, mismatched node selectors, and so on.This means that the closest thing to handling pods stuck in the
Pendingstate isJobActiveDeadlineSeconds, which accounts for every stage of the job lifecycle on the Buildkite side (including pods that successfully start and are running a job). It also has a default of 6 hours - ample time for a running job, but if you've got pods stuck and you want them to fail faster, you'll be stuck with pending pods for quite a while.This PR aims to introduce a
PodPendingTimeoutconfiguration option, which allows the controller to be configured with a timeout for when a pod has beenPendingwithout start the job for the configured amount of time, default 15 minutes.Potential Caveat: This does introduce a new goroutine to the
podWatcherin order to handle watching the pending state, I don't know if that's an ideal way to do this, so very open to suggestions on better possible ways to do this.Includes tests, including integration tests to validate the timeouts.
I've also tested this locally by configuring a
pod-pending-timeoutto 30s:and can confirm successful job failure due to the pending timeout 👇

❓some open questions