Skip to content

[Flow Control] Garbage Collection for Priority Bands#2097

Merged
k8s-ci-robot merged 9 commits intokubernetes-sigs:mainfrom
evacchi:issue-2012
Jan 22, 2026
Merged

[Flow Control] Garbage Collection for Priority Bands#2097
k8s-ci-robot merged 9 commits intokubernetes-sigs:mainfrom
evacchi:issue-2012

Conversation

@evacchi
Copy link
Copy Markdown
Contributor

@evacchi evacchi commented Jan 8, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #2012.

Does this PR introduce a user-facing change?:

NONE

The PR tries to stick as much as possible to the same pattern for flowState GC'ing.

  • Config: we introduce PriorityBandGCTimeout with a defaultPriorityBandGCTimeout; this defaults to 2 * defaultFlowGCTimeout to give a "grace period" where the bands are retained even if the owning flows have been already collected, so that if a new flow with the same priority is re-created shortly after, they won't have to be reinstantiated from scratch

  • FlowRegistry: we introduce a priorityBandStates sync.Map to be used similarly to flowStates sync.Map

    • keyed with int priorities
    • valued with priorityBandState structs; this struct include a becameIdleAt time.Time field
  • pinActivePriorityBand(priority) almost verbatim copy of pinActiveFlow(key types.FlowKey)

  • pinActiveFlow() now also returns a boolean; when true, we call pinActivePriorityBand(priority)

  • executeGCCycle() now also invokes gcPriorityBands(): updates the idle timestamp for priority bands + marks candidates + deletes them

  • isBandActive(priority) checks for len(band.queues) and uses band.len, band.byteSize to account for in-flight, buffered requests. There is no equivalent to the leaseCount. Does this make sense?

  • gcPriorityBands() fllows verifyAndSweepFlows()

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 8, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 8, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 8650547
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/697277aa15af2f00084f94dc
😎 Deploy Preview https://deploy-preview-2097--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @evacchi!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @evacchi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 8, 2026
@LukeAVanDrie
Copy link
Copy Markdown
Contributor

LukeAVanDrie commented Jan 9, 2026

@evacchi, this is an excellent draft! I agree with you that the flowState pattern (leases + surgical locking) is the right model for managing lifecycle here. However, seeing this implementation laid out has helped me realize a critical circular dependency between the JIT Provisioning path (prepareNewFlow) and this Garbage Collection path (verifyAndSweepPriorityBands).

Problem: "Zombie" Band

Currently, we have two competing authorities for the existence of a band:

  1. JIT Path: Checks Config (Read Lock) → Assumes Band exists → Dispatches to Shard.
  2. GC Path: Checks Idleness (Write Lock) → Removes from Config & Shard.

Because the JIT path does not (and currently cannot) acquire the priorityBandState.gcLock, we have a race condition:

  1. GC verifies a band is empty and pauses.
  2. JIT sees the band exists in Config, validates it, and passes the "check".
  3. GC wakes up and deletes the band from the Shards.
  4. JIT proceeds to use the now-deleted band on the Shard, causing a panic or ErrBandNotFound.

Why We Can't Just Add a Lock

If we try to fix this by making the JIT path acquire band.gcLock.RLock() to prevent deletion, we create a deadlock:

  • JIT needs: Registry.mu (to find the band) → band.gcLock (to keep it alive).
  • GC needs: band.gcLock (to verify empty) → Registry.mu (to delete it from config).
  • Result: circular wait

Proposal: Split Admission from Cleanup

To solve this safely, I believe we need to decouple the admission from memory management (GC). This likely makes Issue #2011 (controller-driven lifecycle) a prerequisite, but with a twist:

  1. Soft Delete (Controller / [Flow Control] Refactor: Drive Priority Band lifecycle from InferenceObjective controller. #2011): The Controller removes the Band from the Registry.Config. New requests are rejected (503) or defaulted. This stops any new admissions.
  2. Hard Delete (Your PR): The GC scans for bands that are (a) Missing from Config AND (b) Empty / Idle. It then cleans up the memory.

This breaks the lock hierarchy cycle because the JIT path never fights the GC. The JIT only cares if it's in the Config. The GC only touches it if it's not in the Config.

Next Steps?

I think we have two options:

  1. Pivot to [Flow Control] Refactor: Drive Priority Band lifecycle from InferenceObjective controller. #2011: We implement the Controller-driven logic first, making the "Soft Delete" the authority. Then this PR becomes the future cleanup mechanism.
  2. Adapt this PR: We can try to implement a leaseCount on PriorityBandState (similar to Flows) that the JIT path increments. This effectively "holds" the band in memory even if the Config is gone.

What are your thoughts? I am happy to hop on a call to whiteboard the locking hierarchy if it helps!


This is resolved with #2127.

Comment thread pkg/epp/flowcontrol/registry/registry.go Outdated
Comment thread pkg/epp/flowcontrol/registry/registry.go Outdated
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026
@evacchi
Copy link
Copy Markdown
Contributor Author

evacchi commented Jan 15, 2026

reworking on top of #2127

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 15, 2026
@evacchi
Copy link
Copy Markdown
Contributor Author

evacchi commented Jan 16, 2026

top post updated, reworked this on top of #2127 to follow the same pattern @LukeAVanDrie let me know what you think 🙏

@evacchi evacchi changed the title [WIP] [Flow Control] Garbage Collection for Priority Bands [Flow Control] Garbage Collection for Priority Bands Jan 16, 2026
@evacchi evacchi marked this pull request as ready for review January 16, 2026 20:44
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2026
@evacchi evacchi requested a review from LukeAVanDrie January 20, 2026 12:34
@LukeAVanDrie
Copy link
Copy Markdown
Contributor

will need to take a look at #2031

Sorry for the conflict here. It actually removes a source of fallibility from the registration path, so it hopefully was not too difficult to address. I am taking a look over this PR this morning. Thanks again!

@evacchi
Copy link
Copy Markdown
Contributor Author

evacchi commented Jan 21, 2026

no it was flawless, just a plain rebase, thx for asking 🙏

Copy link
Copy Markdown
Contributor

@LukeAVanDrie LukeAVanDrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! This PR looks really good, especially the added test coverage. From a correctness point of view, I think this is ready to merge. I have high confidence that this solves the issue without risking any race conditions/leaks.

I added a few comments on ideas for future simplification / deduplication, but I don't think we should tackle them in this PR.

// TODO:(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/1982) revert to 5m once this GC
// race condition is properly resolved.
defaultFlowGCTimeout time.Duration = 1 * time.Hour
defaultFlowGCTimeout time.Duration = 5 * time.Minute
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- this fix was just merged today, so 5 * time.Minute is safe again.


// Release the band lease if we created the flow.
// If JIT provisioning fails for a new flow, we must release that lease to prevent leaking band leases.
if isNewFlow {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is correct and necessary; however, this highlights a slight leak in our abstraction layers. We are manually unwinding distinct state layers (flowStates delete, then priorityBand release). We may want to consider some resourceLease abstraction that bundles these.

e.g., lease := fr.acquireLease(key) where lease.Release() handles both flow and band decrements.

That being said, I wouldn't make this change in this PR. The current explicitness is safe given the complexity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note: priorityBandState is almost structurally identical to flowState, and
pinActivePriorityBand duplicates the pinActiveFlow CAS-loop logic.

I'm not sure how much of an improvement this would be, but we can consider a generic LeasedResource[K comparable] struct or a helper for the Pin --> LoadOrStore --> Lock pattern to reduce the line count and surface area for bugs. I would try to get this PR merged first though as it is already correct and in a good state.

band := val.(*priorityBand)

// Check queue count under lock
shard.mu.RLock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This acquires a read lock on every shard for every band during the GC cycle. With many bands, this scan mechanism could become heavy.

We can add an atomic.Int64 activeQueues to the priorityBand struct in shard.go, updating it when adding/removing queues. This would make isBandActive lock-free and remove the shard lock dependency from the registry GC loop.

Now, since priorities are tied to InferenceObjective CRDs and we run with a single shard by default, this seems like overkill at the moment. Just something to keep in mind if we hit scaling limits down the road.

@LukeAVanDrie
Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2026
Comment thread pkg/epp/flowcontrol/registry/shard.go Outdated
s.logger.V(logging.DEBUG).Info("Removed priority band from shard", "priority", priority)
}

// sortPriorityLevels sorts the orderedPriorityLevels slice in descending order (highest priority first).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 22, 2026
Comment on lines +590 to +604
// Normally we may assume that only one GC loop is running globally: the following check is defensive.
// Concurrent GC might happen in test cases if a GC cycle is triggered concurrently with a background GC loop.
// In the case of concurrent GC execution, both GC cycles might see the same flow in their Range() snapshots.
// Only the first one to delete it should release the band lease. This prevents double-release bugs.
if _, existed := fr.flowStates.LoadAndDelete(key); existed {
flowsToClean = append(flowsToClean, key.(types.FlowKey))
fr.logger.V(logging.VERBOSE).Info("Garbage collecting flow", "flowKey", key, "becameIdleAt", idleTime)

// 5. Release the band lease.
// Every flow in the map holds exactly one band lease. This flow is being destroyed,
// so decrement the band's flow count.
if bandVal, ok := fr.priorityBandStates.Load(priority); ok {
bandState := bandVal.(*priorityBandState)
fr.releasePriorityBand(bandState)
}
Copy link
Copy Markdown
Contributor Author

@evacchi evacchi Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeAVanDrie rerunning the tests with -race I noticed the band lease count might go to -1, because the tests are both invoking executeGCCycle() explicitly and spinning up a background GC loop in newRegistryTestHarness(). I am not sure this is 100% intentional in the flow tests, it is incorrect for bands tests. So:

  1. I am adding a manualGC flag to harnessOptions and set it to true in the band tests, to ensure they run deterministically
  2. I am adding a defensive check here to make sure that leaseCount is not decremented twice if executeGCCycle() runs concurrently -- this is actually redundant if concurrent GC is not allowed

however, if executeGCCycle() is not meant to run concurrently, maybe we should add an atomic boolean to FlowRegistry and assert it's false when we enter executeGCCycle() (this will break the Flow GC tests!); then we can revert to a plain Delete() here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a manualGC flag to harnessOptions and set it to true in the band tests, to ensure they run deterministically

The way I avoid this in the flow tests is by setting the idle timeout in the config to be very large (e.g., 1hr) effectively disabling the GC loop. Then you use the injected clock to step to the relevant times needed for your test logic.

however, if executeGCCycle() is not meant to run concurrently, maybe we should add an atomic boolean to FlowRegistry and assert it's false when we enter executeGCCycle() (this will break the Flow GC tests!); then we can revert to a plain Delete() here

It is not meant to run concurrently, so this seems reasonable to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer the simpler route here than another defensive check.

@evacchi evacchi requested a review from LukeAVanDrie January 22, 2026 12:02
@evacchi
Copy link
Copy Markdown
Contributor Author

evacchi commented Jan 22, 2026

/retest-required

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@evacchi: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 22, 2026

/ok-to-test
/approve
/hold

Only holding b/c I dont want it to autosubmit off my comment, will let @LukeAVanDrie control when this is ready. Thanks all!

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 22, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evacchi, kfswain, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2026
@LukeAVanDrie
Copy link
Copy Markdown
Contributor

/lgtm
/remove-hold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 22, 2026
@k8s-ci-robot k8s-ci-robot merged commit 14279f0 into kubernetes-sigs:main Jan 22, 2026
9 checks passed
elevran pushed a commit to llm-d/llm-d-inference-scheduler that referenced this pull request Apr 23, 2026
…/gateway-api-inference-extension#2097)

* [Flow Control] Garbage Collection for Priority Bands

Signed-off-by: Edoardo Vacchi <[email protected]>

* test cases

Signed-off-by: Edoardo Vacchi <[email protected]>

* Rebuilt on top of main

Signed-off-by: Edoardo Vacchi <[email protected]>

* redundant tests

Signed-off-by: Edoardo Vacchi <[email protected]>

* naming conventions

Signed-off-by: Edoardo Vacchi <[email protected]>

* fix comments

Signed-off-by: Edoardo Vacchi <[email protected]>

* remove unused code

Signed-off-by: Edoardo Vacchi <[email protected]>

* fix config tests

Signed-off-by: Edoardo Vacchi <[email protected]>

* defensive LoadAndDelete() on bands, ensure tests won't GC concurrently

Signed-off-by: Edoardo Vacchi <[email protected]>

---------

Signed-off-by: Edoardo Vacchi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flow Control] Implement Garbage Collection for Priority Bands.

4 participants