Skip to content

Add --pod-pending-timeout config option#832

Merged
DrJosh9000 merged 8 commits into
mainfrom
jb/add-pod-pending-timeout
Mar 17, 2026
Merged

Add --pod-pending-timeout config option#832
DrJosh9000 merged 8 commits into
mainfrom
jb/add-pod-pending-timeout

Conversation

@jeremybumsted
Copy link
Copy Markdown
Contributor

Currently, the controller has a handful of different levers to control when a pod will get terminated, specifically when a pod has been stuck in Pending for a long time:

  • EmptyJobGracePeriod is set for a job that has started but never created a pod
  • ImagePullBackOffGracePeriod is for when the pods are created but stuck in the ImagePullBackOff state
  • JobActiveDeadlineSeconds would handle pretty much every other case, like those pods that are generally pending due to insufficient cluster resources, mismatched node selectors, and so on.

This means that the closest thing to handling pods stuck in the Pending state is JobActiveDeadlineSeconds, which accounts for every stage of the job lifecycle on the Buildkite side (including pods that successfully start and are running a job). It also has a default of 6 hours - ample time for a running job, but if you've got pods stuck and you want them to fail faster, you'll be stuck with pending pods for quite a while.

This PR aims to introduce a PodPendingTimeout configuration option, which allows the controller to be configured with a timeout for when a pod has been Pending without start the job for the configured amount of time, default 15 minutes.

Potential Caveat: This does introduce a new goroutine to the podWatcher in order to handle watching the pending state, I don't know if that's an ideal way to do this, so very open to suggestions on better possible ways to do this.

Includes tests, including integration tests to validate the timeouts.

I've also tested this locally by configuring a pod-pending-timeout to 30s:

config:
    debug: true
    pod-pending-timeout: 30s
    queue: kubernetes

and can confirm successful job failure due to the pending timeout 👇
CleanShot 2026-03-06 at 22 04 38

❓some open questions

  • I landed on a default timeout period of 15 minutes; that feels like a reasonable amount of time for a pod to start, but perhaps it would be better to be an optional config (no config value set = no pod pending timeout)?
  • Does this feel like I'm introducing a sharp edge into the controller? (maybe removing a default timeout would be better?)

  Pending pod timeouts allows for the controller to fail a kubernetes
  job that has been pending for the configured timeout duration, such
  as mimatched node selectors, resource constraints, etc.

  Uses pod.CreatedTime to determine when a pending pod should be removed
  and the associated buildkite job failed by the controller.

  Adds a new config option: --pod-pending-timeout, with a default of 5
  minutes.
to cover pod-pending-timeout config option and
update default to 15 minutes from 5.
@jeremybumsted jeremybumsted requested a review from a team as a code owner March 7, 2026 05:12
@jeremybumsted jeremybumsted requested review from a team and removed request for a team March 7, 2026 05:13
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good! I have some stylistic comments.

Comment thread internal/controller/scheduler/pod_watcher_test.go Outdated
Comment thread internal/controller/scheduler/pod_watcher.go Outdated
Comment thread internal/controller/scheduler/pod_watcher.go Outdated
@jeremybumsted
Copy link
Copy Markdown
Contributor Author

jeremybumsted commented Mar 17, 2026

Thank you @DrJosh9000! Applied the suggestions and learned some things 💪

Also removed a test case that is no longer needed due to the change removing the explicit check for createdAt.IsZero().

In this particular case, we were testing for IsZero() to
return `true`. However, with the changes to evaluate time.Since
since a zero time will always return the max duration value, which is
always greater than or equal to the timeout.

In practice, a pod also shouldn't ever have a zero timestamp:
  1. Zero timestamps never occur in real-life scenarios (Kubernetes API server always sets CreationTimestamp)
  2. The scenario only exists in manually constructed test objects
  3. If it did somehow happen, treating it as timed out (current behavior) is reasonable
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

@DrJosh9000 DrJosh9000 merged commit fd4ebba into main Mar 17, 2026
1 check passed
@DrJosh9000 DrJosh9000 deleted the jb/add-pod-pending-timeout branch March 17, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants