Configurable work queue limit by DrJosh9000 · Pull Request #584 · buildkite/agent-stack-k8s

DrJosh9000 · 2025-05-06T02:54:34Z

What

Apply deduplication within the work queue, and enforce a size limit.

Why

The query reset interval combined with the delay between receiving jobs and the agents starting up means that the monitor query will receive duplicate jobs. This isn't a problem downstream of the limiter (the deduper takes care of it) but currently the duplicates will occupy memory in the limiter's work queue. They can be deduped here too.

If the rate of incoming jobs is high, and the rate of being able to create jobs is constrained (e.g. with max in flight), then the work queue could exhaust the controller's memory limit. Dropping jobs out of the queue to fit within the limit is fine, since they will be picked up again when the query cursor is reset.

swaller-bk

Looks good and I think will acheive the goal.

Below here is theory-crafting / navel-gazing that does not block/suggest changes to this PR!

I wonder if a future model has the controller directly moving jobs from scheduled to active on the bk api and creating a custom resource on the cluster to represent the job (up to a configurable concurrency limit), which is then is reconciled by a controller that imports almost all the logic from the agent.

This would lower the amount of "in-flight but not in-flight jobs" that can be dupes, as well as outsources the state storage, deduping etc. to etcd/kube.

Obvious downsides are:

the controller and agent have some overlapping responsibilities
big changes to the controller as it stands

swaller-bk · 2025-05-06T03:49:48Z

One question that came to mind after I hit approve, how did you arrive at 1000000 jobs as the default limit?

DrJosh9000 · 2025-05-06T04:17:00Z

I wonder if a future model has the controller directly moving jobs from scheduled to active on the bk api and creating a custom resource on the cluster to represent the job (up to a configurable concurrency limit), which is then is reconciled by a controller that imports almost all the logic from the agent.

This would lower the amount of "in-flight but not in-flight jobs" that can be dupes, as well as outsources the state storage, deduping etc. to etcd/kube.

Obvious downsides are:

the controller and agent have some overlapping responsibilities

big changes to the controller as it stands

The custom resource idea is interesting! I had considered the "directly moving jobs from scheduled to active on the bk api" be near-term work for precisely the reason you call out about reducing in-flight-but-not jobs, but hadn't settled on any approach for preventing those intermediate-state jobs from becoming lost.

If the controller is acting like an agent in that sense, it would have to make periodic heartbeat calls to prevent BK considering it "lost", so maybe it's not needed? Still thinking about it.

One question that came to mind after I hit approve, how did you arrive at 1000000 jobs as the default limit?

A guesstimate, really. On the memory side, "around 100 bytes per job" * 1e6 jobs = 100 MB, which seems like a reasonable amount of memory to allow. On the query side, 10 seconds * 1000 jobs per page * 2 pages per second (defaults) is only 20K jobs. Increasing the pages per query to 20 and query reset interval to say 60 seconds brings it to 1.2M jobs.

DrJosh9000 requested a review from a team as a code owner May 6, 2025 02:54

Configurable work queue limit

d36d428

DrJosh9000 force-pushed the prevent-work-queue-memory-exhaustion branch from 5397f0f to d36d428 Compare May 6, 2025 02:55

swaller-bk approved these changes May 6, 2025

View reviewed changes

DrJosh9000 merged commit c60616d into main May 6, 2025
1 check passed

DrJosh9000 deleted the prevent-work-queue-memory-exhaustion branch May 6, 2025 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable work queue limit#584

Configurable work queue limit#584
DrJosh9000 merged 1 commit into
mainfrom
prevent-work-queue-memory-exhaustion

DrJosh9000 commented May 6, 2025

Uh oh!

swaller-bk left a comment •

edited

Loading

Uh oh!

swaller-bk commented May 6, 2025

Uh oh!

DrJosh9000 commented May 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DrJosh9000 commented May 6, 2025

What

Why

Uh oh!

swaller-bk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swaller-bk commented May 6, 2025

Uh oh!

DrJosh9000 commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swaller-bk left a comment •

edited

Loading

DrJosh9000 commented May 6, 2025 •

edited

Loading