Add --pod-pending-timeout config option by jeremybumsted · Pull Request #832 · buildkite/agent-stack-k8s

jeremybumsted · 2026-03-07T05:12:00Z

Currently, the controller has a handful of different levers to control when a pod will get terminated, specifically when a pod has been stuck in Pending for a long time:

EmptyJobGracePeriod is set for a job that has started but never created a pod
ImagePullBackOffGracePeriod is for when the pods are created but stuck in the ImagePullBackOff state
JobActiveDeadlineSeconds would handle pretty much every other case, like those pods that are generally pending due to insufficient cluster resources, mismatched node selectors, and so on.

This means that the closest thing to handling pods stuck in the Pending state is JobActiveDeadlineSeconds, which accounts for every stage of the job lifecycle on the Buildkite side (including pods that successfully start and are running a job). It also has a default of 6 hours - ample time for a running job, but if you've got pods stuck and you want them to fail faster, you'll be stuck with pending pods for quite a while.

This PR aims to introduce a PodPendingTimeout configuration option, which allows the controller to be configured with a timeout for when a pod has been Pending without start the job for the configured amount of time, default 15 minutes.

Potential Caveat: This does introduce a new goroutine to the podWatcher in order to handle watching the pending state, I don't know if that's an ideal way to do this, so very open to suggestions on better possible ways to do this.

Includes tests, including integration tests to validate the timeouts.

I've also tested this locally by configuring a pod-pending-timeout to 30s:

config:
    debug: true
    pod-pending-timeout: 30s
    queue: kubernetes

and can confirm successful job failure due to the pending timeout 👇

❓some open questions

I landed on a default timeout period of 15 minutes; that feels like a reasonable amount of time for a pod to start, but perhaps it would be better to be an optional config (no config value set = no pod pending timeout)?
Does this feel like I'm introducing a sharp edge into the controller? (maybe removing a default timeout would be better?)

Pending pod timeouts allows for the controller to fail a kubernetes job that has been pending for the configured timeout duration, such as mimatched node selectors, resource constraints, etc. Uses pod.CreatedTime to determine when a pending pod should be removed and the associated buildkite job failed by the controller. Adds a new config option: --pod-pending-timeout, with a default of 5 minutes.

to cover pod-pending-timeout config option and update default to 15 minutes from 5.

DrJosh9000

This is very good! I have some stylistic comments.

Co-authored-by: Josh Deprez <jd@buildkite.com>

jeremybumsted · 2026-03-17T16:09:42Z

Thank you @DrJosh9000! Applied the suggestions and learned some things 💪

Also removed a test case that is no longer needed due to the change removing the explicit check for createdAt.IsZero().

In this particular case, we were testing for IsZero() to return `true`. However, with the changes to evaluate time.Since since a zero time will always return the max duration value, which is always greater than or equal to the timeout. In practice, a pod also shouldn't ever have a zero timestamp: 1. Zero timestamps never occur in real-life scenarios (Kubernetes API server always sets CreationTimestamp) 2. The scenario only exists in manually constructed test objects 3. If it did somehow happen, treating it as timed out (current behavior) is reasonable

DrJosh9000

Good work!

jeremybumsted added 2 commits March 6, 2026 21:22

add package and integration tests

ff39bdc

to cover pod-pending-timeout config option and update default to 15 minutes from 5.

jeremybumsted requested a review from a team as a code owner March 7, 2026 05:12

jeremybumsted requested review from a team and removed request for a team March 7, 2026 05:13

DrJosh9000 reviewed Mar 12, 2026

View reviewed changes

Comment thread internal/controller/scheduler/pod_watcher_test.go Outdated

Comment thread internal/controller/scheduler/pod_watcher.go Outdated

Comment thread internal/controller/scheduler/pod_watcher.go Outdated

jeremybumsted and others added 5 commits March 17, 2026 09:15

Update internal/controller/scheduler/pod_watcher.go

aa4b70c

Co-authored-by: Josh Deprez <jd@buildkite.com>

Update internal/controller/scheduler/pod_watcher.go

e33fc48

Co-authored-by: Josh Deprez <jd@buildkite.com>

Update internal/controller/scheduler/pod_watcher_test.go

aa994cb

Co-authored-by: Josh Deprez <jd@buildkite.com>

Merge branch 'main' into jb/add-pod-pending-timeout

a10faed

chore(tests): remove unused testify package

306b2b3

jeremybumsted requested a review from DrJosh9000 March 17, 2026 17:01

DrJosh9000 approved these changes Mar 17, 2026

View reviewed changes

DrJosh9000 merged commit fd4ebba into main Mar 17, 2026
1 check passed

DrJosh9000 deleted the jb/add-pod-pending-timeout branch March 17, 2026 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --pod-pending-timeout config option#832

Add --pod-pending-timeout config option#832
DrJosh9000 merged 8 commits into
mainfrom
jb/add-pod-pending-timeout

jeremybumsted commented Mar 7, 2026

Uh oh!

DrJosh9000 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremybumsted commented Mar 17, 2026 •

edited

Loading

Uh oh!

DrJosh9000 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeremybumsted commented Mar 7, 2026

❓some open questions

Uh oh!

DrJosh9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremybumsted commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrJosh9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeremybumsted commented Mar 17, 2026 •

edited

Loading