Skip to content

A-1202 part 2: handle pod failed before agent start#883

Merged
zhming0 merged 1 commit into
mainfrom
ming/a-1202-part-2
May 7, 2026
Merged

A-1202 part 2: handle pod failed before agent start#883
zhming0 merged 1 commit into
mainfrom
ming/a-1202-part-2

Conversation

@zhming0
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 commented May 7, 2026

Problem

When a Kubernetes pod fails before the buildkite-agent container has a chance to acquire the Buildkite job (e.g. OOMKill, eviction, node failure, external kubectl delete pod), the BK job stays in reserved state until the reservation TTL (~15 min) expires. Re-dispatch is delayed by the full TTL.

Fixes #873 (ask #3).

Alternative, a theoretically more correct solution would be moving the agent container to initContainer as a sidecar too, but this will have an impact on how exit signal is handled, so it need a bit of more thinking.

Even if we were going to make the agent as initContainer, this change won't harm.

Change

Extend jobWatcher's finished-job handling: when the K8s Job ends with Status.Failed > 0, query BK for the job state. If still reserved, call FailJob to release the reservation immediately.

Only acts on reserved state — scheduled (no reservation made) and any post-acquire state are left to BK.

@zhming0 zhming0 requested a review from a team May 7, 2026 04:20
@zhming0 zhming0 requested a review from a team as a code owner May 7, 2026 04:20
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic seems good to me, I have 1 suggestion for the test

roko.WithMaxAttempts(10),
roko.WithStrategy(roko.Exponential(2*time.Second, 0)),
).DoWithContext(ctx, func(r *roko.Retrier) error {
exist, err := hasJobPod(tc, ctx, opts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasJobPod calls the same List method as used below to list the pods, so perhaps this should call List and save the returned list of pods for deletion?

@zhming0 zhming0 force-pushed the ming/a-1202-part-2 branch from 0dbeec3 to 6c6d982 Compare May 7, 2026 06:50
@zhming0 zhming0 enabled auto-merge May 7, 2026 06:54
@zhming0 zhming0 merged commit bbb5e85 into main May 7, 2026
3 checks passed
@zhming0 zhming0 deleted the ming/a-1202-part-2 branch May 7, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sidecar containers still misclassified as init-container failures despite v0.42.0 fix

2 participants