Skip to content

PB-862: use Kubernetes native sidecar mechanism#759

Merged
zhming0 merged 1 commit into
mainfrom
ming/pb-862
Nov 4, 2025
Merged

PB-862: use Kubernetes native sidecar mechanism#759
zhming0 merged 1 commit into
mainfrom
ming/pb-862

Conversation

@zhming0
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 commented Oct 27, 2025

Historically, k8s didn't have a main container vs sidecar container concept. As a result we maintain a in-house built sidecar pattern for agent-k8s stack, it means we have to implement many code to make a container sidecar: it includes relying on activeDeadlineSeconds to kill the pod after "main" container is done etc.

It also isn't ideal because it can leave a k8s job in a status of type: Failed with reason: DeadlineExceeded despite it successfully handled the bk job.

This PR changes k8s stack to use Kuberenetes' native sidecar pattern.

@zhming0 zhming0 marked this pull request as ready for review October 27, 2025 05:38
@zhming0 zhming0 requested a review from a team as a code owner October 27, 2025 05:38
@zhming0 zhming0 requested a review from a team October 27, 2025 05:38
Comment on lines +26 to +29
# So Kubernetes can mark this as Started
# In reality a sidecar should not be used like this.
# As this is effectively an ad-hoc initContainer.
sleep 120
Copy link
Copy Markdown
Contributor Author

@zhming0 zhming0 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential breaking change. In the new sidecar implementation, the sidecar container is expected to be long running and never finish.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, true. We'll have to call this out carefully in the release notes.

@zhming0 zhming0 requested a review from DrJosh9000 October 28, 2025 03:56
Comment thread internal/controller/scheduler/completions.go Outdated
Comment thread internal/controller/scheduler/completions.go Outdated
"github.com/buildkite/agent/v3/agent"
"github.com/buildkite/roko"

"log/slog"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually belongs in the first group (standard library packages).

// (our in-house built) sidecar containers.
// Now it's no longer necessary.
//
// I am leaning to keep it because:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote in favour of deleting the whole completion watcher.

If the containers are misbehaving so badly that they are blocking the pod finishing, that sounds like a serious bug we should fix. Leaving the pod running would enable more ways to diagnose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection from me. It's gone now 🫡

Comment on lines +26 to +29
# So Kubernetes can mark this as Started
# In reality a sidecar should not be used like this.
# As this is effectively an ad-hoc initContainer.
sleep 120
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, true. We'll have to call this out carefully in the release notes.

@petetomasik
Copy link
Copy Markdown
Contributor

This change also adds a new minimum cluster version requirement of 1.33.0. We will need to ensure this is distinctly called out in our docs, as a controller upgrade in an older K8s cluster would break the customer's deployment. Perhaps we should add a minimum cluster version startup "check" to the controller and exit early?

@scadu
Copy link
Copy Markdown
Contributor

scadu commented Oct 28, 2025

This change also adds a new minimum cluster version requirement of 1.33.0. We will need to ensure this is distinctly called out in our docs, as a controller upgrade in an older K8s cluster would break the customer's deployment. Perhaps we should add a minimum cluster version startup "check" to the controller and exit early?

Aside from that, it'd be nice to have a fallback for a while, even though I'd also like to rip old code. I imagine a lot of broken controllers right now.
If we want to proceed anyway, having a version requirement in the chart to halt deployment early would be welcome.

@DrJosh9000
Copy link
Copy Markdown
Contributor

This change also adds a new minimum cluster version requirement of 1.33.0.

the feature is active by default since Kubernetes v1.29

We will need to ensure this is distinctly called out in our docs, as a controller upgrade in an older K8s cluster would break the customer's deployment. Perhaps we should add a minimum cluster version startup "check" to the controller and exit early?

Yes and yes.

@zhming0
Copy link
Copy Markdown
Contributor Author

zhming0 commented Oct 28, 2025

I am hoping to do a scan to find out version distribution in our existing users. But we don't seem to collect this information.

Nor do we have a way to engage with the crowd en mass.

So we will have to make a small bet. If it helps, 1.31 EOL is yesterday, 1.29 I presume EOL will be half year ago.

1.31
Latest Release:1.31.13 (released: 2025-09-09)
End of Life:2025-10-28

but yeah, 100% on a kubeVersion field.

@zhming0
Copy link
Copy Markdown
Contributor Author

zhming0 commented Oct 29, 2025

ready for another review @DrJosh9000 🙏🏿

Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were a way to test for individual feature gates in Chart.yaml rather than k8s versions

Anyway, LGTM with some minor comments

Comment thread internal/controller/scheduler/scheduler.go Outdated
Comment thread charts/agent-stack-k8s/Chart.yaml
# So Kubernetes can mark this as Started
# In reality a sidecar should not be used like this.
# As this is effectively an ad-hoc initContainer.
sleep 120
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick play with alpine:latest shows that it supports infinity - maybe we should do this?

Suggested change
sleep 120
sleep infinity

@zhming0 zhming0 enabled auto-merge November 4, 2025 23:26
@zhming0 zhming0 merged commit d59e1fe into main Nov 4, 2025
1 check passed
@zhming0 zhming0 deleted the ming/pb-862 branch November 4, 2025 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants