A-1202 part 2: handle pod failed before agent start by zhming0 · Pull Request #883 · buildkite/agent-stack-k8s

zhming0 · 2026-05-07T04:20:33Z

Problem

When a Kubernetes pod fails before the buildkite-agent container has a chance to acquire the Buildkite job (e.g. OOMKill, eviction, node failure, external kubectl delete pod), the BK job stays in reserved state until the reservation TTL (~15 min) expires. Re-dispatch is delayed by the full TTL.

Fixes #873 (ask #3).

Alternative, a theoretically more correct solution would be moving the agent container to initContainer as a sidecar too, but this will have an impact on how exit signal is handled, so it need a bit of more thinking.

Even if we were going to make the agent as initContainer, this change won't harm.

Change

Extend jobWatcher's finished-job handling: when the K8s Job ends with Status.Failed > 0, query BK for the job state. If still reserved, call FailJob to release the reservation immediately.

Only acts on reserved state — scheduled (no reservation made) and any post-acquire state are left to BK.

DrJosh9000

Logic seems good to me, I have 1 suggestion for the test

DrJosh9000 · 2026-05-07T05:22:48Z

+		roko.WithMaxAttempts(10),
+		roko.WithStrategy(roko.Exponential(2*time.Second, 0)),
+	).DoWithContext(ctx, func(r *roko.Retrier) error {
+		exist, err := hasJobPod(tc, ctx, opts)


hasJobPod calls the same List method as used below to list the pods, so perhaps this should call List and save the returned list of pods for deletion?

zhming0 requested a review from a team May 7, 2026 04:20

zhming0 requested a review from a team as a code owner May 7, 2026 04:20

DrJosh9000 approved these changes May 7, 2026

View reviewed changes

A-1202 part 2: handle pod failed before agent start

6c6d982

zhming0 force-pushed the ming/a-1202-part-2 branch from 0dbeec3 to 6c6d982 Compare May 7, 2026 06:50

zhming0 enabled auto-merge May 7, 2026 06:54

zhming0 merged commit bbb5e85 into main May 7, 2026
3 checks passed

zhming0 deleted the ming/a-1202-part-2 branch May 7, 2026 06:55

zhming0 mentioned this pull request May 12, 2026

Sidecar containers still misclassified as init-container failures despite v0.42.0 fix #873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A-1202 part 2: handle pod failed before agent start#883

A-1202 part 2: handle pod failed before agent start#883
zhming0 merged 1 commit into
mainfrom
ming/a-1202-part-2

zhming0 commented May 7, 2026

Uh oh!

DrJosh9000 left a comment

Uh oh!

DrJosh9000 May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhming0 commented May 7, 2026

Problem

Change

Uh oh!

DrJosh9000 left a comment

Choose a reason for hiding this comment

Uh oh!

DrJosh9000 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants