[Flow Control] Garbage Collection for Priority Bands by evacchi · Pull Request #2097 · kubernetes-sigs/gateway-api-inference-extension

evacchi · 2026-01-08T14:25:39Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #2012.

Does this PR introduce a user-facing change?:

NONE

The PR tries to stick as much as possible to the same pattern for flowState GC'ing.

Config: we introduce PriorityBandGCTimeout with a defaultPriorityBandGCTimeout; this defaults to 2 * defaultFlowGCTimeout to give a "grace period" where the bands are retained even if the owning flows have been already collected, so that if a new flow with the same priority is re-created shortly after, they won't have to be reinstantiated from scratch
FlowRegistry: we introduce a priorityBandStates sync.Map to be used similarly to flowStates sync.Map
- keyed with int priorities
- valued with priorityBandState structs; this struct include a becameIdleAt time.Time field
pinActivePriorityBand(priority) almost verbatim copy of pinActiveFlow(key types.FlowKey)
pinActiveFlow() now also returns a boolean; when true, we call pinActivePriorityBand(priority)
executeGCCycle() now also invokes gcPriorityBands(): updates the idle timestamp for priority bands + marks candidates + deletes them
isBandActive(priority) checks for len(band.queues) and uses band.len, band.byteSize to account for in-flight, buffered requests. There is no equivalent to the leaseCount. Does this make sense?
gcPriorityBands() fllows verifyAndSweepFlows()

netlify · 2026-01-08T14:25:46Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`8650547`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/697277aa15af2f00084f94dc
😎 Deploy Preview	https://deploy-preview-2097--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-01-08T14:25:48Z

Welcome @evacchi!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-01-08T14:25:49Z

Hi @evacchi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2026-01-09T00:34:22Z

@evacchi, this is an excellent draft! I agree with you that the flowState pattern (leases + surgical locking) is the right model for managing lifecycle here. However, seeing this implementation laid out has helped me realize a critical circular dependency between the JIT Provisioning path (prepareNewFlow) and this Garbage Collection path (verifyAndSweepPriorityBands).

Problem: "Zombie" Band

Currently, we have two competing authorities for the existence of a band:

JIT Path: Checks Config (Read Lock) → Assumes Band exists → Dispatches to Shard.
GC Path: Checks Idleness (Write Lock) → Removes from Config & Shard.

Because the JIT path does not (and currently cannot) acquire the priorityBandState.gcLock, we have a race condition:

GC verifies a band is empty and pauses.
JIT sees the band exists in Config, validates it, and passes the "check".
GC wakes up and deletes the band from the Shards.
JIT proceeds to use the now-deleted band on the Shard, causing a panic or ErrBandNotFound.

Why We Can't Just Add a Lock

If we try to fix this by making the JIT path acquire band.gcLock.RLock() to prevent deletion, we create a deadlock:

JIT needs: Registry.mu (to find the band) → band.gcLock (to keep it alive).
GC needs: band.gcLock (to verify empty) → Registry.mu (to delete it from config).
Result: circular wait

Proposal: Split Admission from Cleanup

To solve this safely, I believe we need to decouple the admission from memory management (GC). This likely makes Issue #2011 (controller-driven lifecycle) a prerequisite, but with a twist:

Soft Delete (Controller / [Flow Control] Refactor: Drive Priority Band lifecycle from InferenceObjective controller. #2011): The Controller removes the Band from the Registry.Config. New requests are rejected (503) or defaulted. This stops any new admissions.
Hard Delete (Your PR): The GC scans for bands that are (a) Missing from Config AND (b) Empty / Idle. It then cleans up the memory.

This breaks the lock hierarchy cycle because the JIT path never fights the GC. The JIT only cares if it's in the Config. The GC only touches it if it's not in the Config.

Next Steps?

I think we have two options:

Pivot to [Flow Control] Refactor: Drive Priority Band lifecycle from InferenceObjective controller. #2011: We implement the Controller-driven logic first, making the "Soft Delete" the authority. Then this PR becomes the future cleanup mechanism.
Adapt this PR: We can try to implement a leaseCount on PriorityBandState (similar to Flows) that the JIT path increments. This effectively "holds" the band in memory even if the Config is gone.

What are your thoughts? I am happy to hop on a call to whiteboard the locking hierarchy if it helps!

This is resolved with #2127.

evacchi · 2026-01-15T12:18:00Z

reworking on top of #2127

evacchi · 2026-01-16T13:11:36Z

top post updated, reworked this on top of #2127 to follow the same pattern @LukeAVanDrie let me know what you think 🙏

LukeAVanDrie · 2026-01-21T16:22:49Z

will need to take a look at #2031

Sorry for the conflict here. It actually removes a source of fallibility from the registration path, so it hopefully was not too difficult to address. I am taking a look over this PR this morning. Thanks again!

evacchi · 2026-01-21T16:29:39Z

no it was flawless, just a plain rebase, thx for asking 🙏

LukeAVanDrie

Thanks for picking this up! This PR looks really good, especially the added test coverage. From a correctness point of view, I think this is ready to merge. I have high confidence that this solves the issue without risking any race conditions/leaks.

I added a few comments on ideas for future simplification / deduplication, but I don't think we should tackle them in this PR.

LukeAVanDrie · 2026-01-22T00:40:29Z

-	// TODO:(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/1982) revert to 5m once this GC
-	// race condition is properly resolved.
-	defaultFlowGCTimeout time.Duration = 1 * time.Hour
+	defaultFlowGCTimeout time.Duration = 5 * time.Minute


Good catch -- this fix was just merged today, so 5 * time.Minute is safe again.

LukeAVanDrie · 2026-01-22T00:57:53Z

+
+		// Release the band lease if we created the flow.
+		// If JIT provisioning fails for a new flow, we must release that lease to prevent leaking band leases.
+		if isNewFlow {


nit: This is correct and necessary; however, this highlights a slight leak in our abstraction layers. We are manually unwinding distinct state layers (flowStates delete, then priorityBand release). We may want to consider some resourceLease abstraction that bundles these.

e.g., lease := fr.acquireLease(key) where lease.Release() handles both flow and band decrements.

That being said, I wouldn't make this change in this PR. The current explicitness is safe given the complexity.

As a general note: priorityBandState is almost structurally identical to flowState, and
pinActivePriorityBand duplicates the pinActiveFlow CAS-loop logic.

I'm not sure how much of an improvement this would be, but we can consider a generic LeasedResource[K comparable] struct or a helper for the Pin --> LoadOrStore --> Lock pattern to reduce the line count and surface area for bugs. I would try to get this PR merged first though as it is already correct and in a good state.

LukeAVanDrie · 2026-01-22T01:04:43Z

+			band := val.(*priorityBand)
+
+			// Check queue count under lock
+			shard.mu.RLock()


nit: This acquires a read lock on every shard for every band during the GC cycle. With many bands, this scan mechanism could become heavy.

We can add an atomic.Int64 activeQueues to the priorityBand struct in shard.go, updating it when adding/removing queues. This would make isBandActive lock-free and remove the shard lock dependency from the registry GC loop.

Now, since priorities are tied to InferenceObjective CRDs and we run with a single shard by default, this seems like overkill at the moment. Just something to keep in mind if we hit scaling limits down the road.

LukeAVanDrie · 2026-01-22T01:13:37Z

/lgtm

LukeAVanDrie · 2026-01-22T01:24:55Z

+	s.logger.V(logging.DEBUG).Info("Removed priority band from shard", "priority", priority)
+}
+
+// sortPriorityLevels sorts the orderedPriorityLevels slice in descending order (highest priority first).


This is unused.

Signed-off-by: Edoardo Vacchi <[email protected]>

evacchi · 2026-01-22T12:01:57Z

+		// Normally we may assume that only one GC loop is running globally: the following check is defensive.
+		// Concurrent GC might happen in test cases if a GC cycle is triggered concurrently with a background GC loop.
+		// In the case of concurrent GC execution, both GC cycles might see the same flow in their Range() snapshots.
+		// Only the first one to delete it should release the band lease. This prevents double-release bugs.
+		if _, existed := fr.flowStates.LoadAndDelete(key); existed {
+			flowsToClean = append(flowsToClean, key.(types.FlowKey))
+			fr.logger.V(logging.VERBOSE).Info("Garbage collecting flow", "flowKey", key, "becameIdleAt", idleTime)
+
+			// 5. Release the band lease.
+			// Every flow in the map holds exactly one band lease. This flow is being destroyed,
+			// so decrement the band's flow count.
+			if bandVal, ok := fr.priorityBandStates.Load(priority); ok {
+				bandState := bandVal.(*priorityBandState)
+				fr.releasePriorityBand(bandState)
+			}


@LukeAVanDrie rerunning the tests with -race I noticed the band lease count might go to -1, because the tests are both invoking executeGCCycle() explicitly and spinning up a background GC loop in newRegistryTestHarness(). I am not sure this is 100% intentional in the flow tests, it is incorrect for bands tests. So:

I am adding a manualGC flag to harnessOptions and set it to true in the band tests, to ensure they run deterministically

I am adding a defensive check here to make sure that leaseCount is not decremented twice if executeGCCycle() runs concurrently -- this is actually redundant if concurrent GC is not allowed

however, if executeGCCycle() is not meant to run concurrently, maybe we should add an atomic boolean to FlowRegistry and assert it's false when we enter executeGCCycle() (this will break the Flow GC tests!); then we can revert to a plain Delete() here

I am adding a manualGC flag to harnessOptions and set it to true in the band tests, to ensure they run deterministically

The way I avoid this in the flow tests is by setting the idle timeout in the config to be very large (e.g., 1hr) effectively disabling the GC loop. Then you use the injected clock to step to the relevant times needed for your test logic.

however, if executeGCCycle() is not meant to run concurrently, maybe we should add an atomic boolean to FlowRegistry and assert it's false when we enter executeGCCycle() (this will break the Flow GC tests!); then we can revert to a plain Delete() here

It is not meant to run concurrently, so this seems reasonable to me.

I would much prefer the simpler route here than another defensive check.

evacchi · 2026-01-22T16:52:50Z

/retest-required

k8s-ci-robot · 2026-01-22T16:53:06Z

@evacchi: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kfswain · 2026-01-22T17:47:44Z

/ok-to-test
/approve
/hold

Only holding b/c I dont want it to autosubmit off my comment, will let @LukeAVanDrie control when this is ready. Thanks all!

k8s-ci-robot · 2026-01-22T17:47:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evacchi, kfswain, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2026-01-22T19:19:20Z

/lgtm
/remove-hold

…/gateway-api-inference-extension#2097) * [Flow Control] Garbage Collection for Priority Bands Signed-off-by: Edoardo Vacchi <[email protected]> * test cases Signed-off-by: Edoardo Vacchi <[email protected]> * Rebuilt on top of main Signed-off-by: Edoardo Vacchi <[email protected]> * redundant tests Signed-off-by: Edoardo Vacchi <[email protected]> * naming conventions Signed-off-by: Edoardo Vacchi <[email protected]> * fix comments Signed-off-by: Edoardo Vacchi <[email protected]> * remove unused code Signed-off-by: Edoardo Vacchi <[email protected]> * fix config tests Signed-off-by: Edoardo Vacchi <[email protected]> * defensive LoadAndDelete() on bands, ensure tests won't GC concurrently Signed-off-by: Edoardo Vacchi <[email protected]> --------- Signed-off-by: Edoardo Vacchi <[email protected]>

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 8, 2026

k8s-ci-robot requested review from elevran and liu-cong January 8, 2026 14:25

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 8, 2026

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 8, 2026

evacchi mentioned this pull request Jan 8, 2026

[Flow Control] Implement Garbage Collection for Priority Bands. #2012

Closed

LukeAVanDrie reviewed Jan 9, 2026

View reviewed changes

Comment thread pkg/epp/flowcontrol/registry/registry.go Outdated

LukeAVanDrie reviewed Jan 9, 2026

View reviewed changes

Comment thread pkg/epp/flowcontrol/registry/registry.go Outdated

LukeAVanDrie mentioned this pull request Jan 14, 2026

[Flow Control] Roadmap #2152

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026

evacchi force-pushed the issue-2012 branch from ffc09fb to 7e5cfc9 Compare January 15, 2026 15:46

evacchi changed the title ~~[WIP] [Flow Control] Garbage Collection for Priority Bands~~ [Flow Control] Garbage Collection for Priority Bands Jan 16, 2026

evacchi marked this pull request as ready for review January 16, 2026 20:44

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2026

k8s-ci-robot requested review from ahg-g and nirrozenbaum January 16, 2026 20:44

evacchi requested a review from LukeAVanDrie January 20, 2026 12:34

LukeAVanDrie approved these changes Jan 22, 2026

View reviewed changes

k8s-ci-robot assigned LukeAVanDrie Jan 22, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2026

LukeAVanDrie reviewed Jan 22, 2026

View reviewed changes

evacchi added 3 commits January 22, 2026 09:36

remove unused code

1c82fa1

Signed-off-by: Edoardo Vacchi <[email protected]>

fix config tests

0f54de8

Signed-off-by: Edoardo Vacchi <[email protected]>

defensive LoadAndDelete() on bands, ensure tests won't GC concurrently

8650547

Signed-off-by: Edoardo Vacchi <[email protected]>

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 22, 2026

evacchi commented Jan 22, 2026

View reviewed changes

evacchi requested a review from LukeAVanDrie January 22, 2026 12:02

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2026

evacchi force-pushed the issue-2012 branch from 5d8b039 to 8650547 Compare January 22, 2026 19:16

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 22, 2026

k8s-ci-robot merged commit 14279f0 into kubernetes-sigs:main Jan 22, 2026
9 checks passed

LukeAVanDrie mentioned this pull request Jan 23, 2026

cleanup(flowcontrol): Refactor registry with generic leasing and atomic GC #2198

Merged

Conversation

evacchi commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jan 8, 2026

Uh oh!

k8s-ci-robot commented Jan 8, 2026

Uh oh!

LukeAVanDrie commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evacchi commented Jan 15, 2026

Uh oh!

evacchi commented Jan 16, 2026

Uh oh!

LukeAVanDrie commented Jan 21, 2026

Uh oh!

evacchi commented Jan 21, 2026

Uh oh!

LukeAVanDrie left a comment

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie commented Jan 22, 2026

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

evacchi Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

evacchi commented Jan 22, 2026

Uh oh!

k8s-ci-robot commented Jan 22, 2026

Uh oh!

kfswain commented Jan 22, 2026

Uh oh!

k8s-ci-robot commented Jan 22, 2026

Uh oh!

LukeAVanDrie commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

evacchi commented Jan 8, 2026 •

edited

Loading

netlify Bot commented Jan 8, 2026 •

edited

Loading

LukeAVanDrie commented Jan 9, 2026 •

edited

Loading

evacchi Jan 22, 2026 •

edited

Loading