A-1202 part 2: handle pod failed before agent start#883
Merged
Conversation
DrJosh9000
approved these changes
May 7, 2026
Contributor
DrJosh9000
left a comment
There was a problem hiding this comment.
Logic seems good to me, I have 1 suggestion for the test
| roko.WithMaxAttempts(10), | ||
| roko.WithStrategy(roko.Exponential(2*time.Second, 0)), | ||
| ).DoWithContext(ctx, func(r *roko.Retrier) error { | ||
| exist, err := hasJobPod(tc, ctx, opts) |
Contributor
There was a problem hiding this comment.
hasJobPod calls the same List method as used below to list the pods, so perhaps this should call List and save the returned list of pods for deletion?
0dbeec3 to
6c6d982
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a Kubernetes pod fails before the
buildkite-agentcontainer has a chance to acquire the Buildkite job (e.g. OOMKill, eviction, node failure, externalkubectl delete pod), the BK job stays inreservedstate until the reservation TTL (~15 min) expires. Re-dispatch is delayed by the full TTL.Fixes #873 (ask #3).
Alternative, a theoretically more correct solution would be moving the
agentcontainer toinitContaineras a sidecar too, but this will have an impact on how exit signal is handled, so it need a bit of more thinking.Even if we were going to make the
agentasinitContainer, this change won't harm.Change
Extend
jobWatcher's finished-job handling: when the K8s Job ends withStatus.Failed > 0, query BK for the job state. If stillreserved, callFailJobto release the reservation immediately.Only acts on
reservedstate —scheduled(no reservation made) and any post-acquire state are left to BK.