Skip to content

controller: extend flow lease scope to fix orphaned queues #1982#2131

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
LukeAVanDrie:fix/1982-extend-flow-lease
Jan 21, 2026
Merged

controller: extend flow lease scope to fix orphaned queues #1982#2131
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
LukeAVanDrie:fix/1982-extend-flow-lease

Conversation

@LukeAVanDrie
Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR fixes a race condition (Issue #1982) where the Flow Registry Garbage Collector could delete a Flow while it still had active requests waiting in a queue.

The Bug

Previously, EnqueueAndWait only acquired a Registry lease during the distribution phase (selecting a shard). Once the request was submitted to a shard's queue, the lease was released. In scenarios like "Scale from Zero," where requests sit in queues for longer than FlowGCTimeout without new incoming traffic, the Registry would see the flow as "Idle" (0 leases) and delete it. This orphaned the queues, causing requests to time out silently.

The Fix

We now extend the scope of WithConnection to cover the entire request lifecycle, from distribution until finalization (dispatch or timeout).

  • The flow remains "pinned" in memory as long as at least one request is active or queued.
  • The GC will correctly skip these flows during its sweep phase.

Safety Note
This change is only safe because of PR #2127 (Optimistic Concurrency).

  • Old Behavior (Mutex): Holding a lease for minutes would have blocked the GC's Write Lock, which would have blocked all subsequent Read Locks, causing a total DoS on the system.
  • New Behavior (Atomic): We can hold the lease indefinitely. The GC simply sees leaseCount > 0 via an atomic load and skips the flow without blocking.

Which issue(s) this PR fixes:
Fixes #1982

Reviewer Notes:

  • This PR must not be merged until PR registry: switch to fine-grained leasing for flow lifecycle #2127 is merged.
  • Interface Change: Added FlowKey() to the ActiveFlowConnection contract. While not strictly necessary, this allows us to pass just the connection object down the stack, enforcing encapsulation.
  • Testing: Added Regression_LeaseHeldDuringQueueing to controller_test.go, which uses channels to deterministically prove the lease is not released while the processor is blocked.

Does this PR introduce a user-facing change?:

Fixed a bug where flows could be garbage collected while requests were still queued during long wait times with no new traffic (e.g., scale-from-zero).

/hold
Waiting for PR #2127

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 12, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 12, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 12, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 583dc10
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/696ac3a629c6f60008018572
😎 Deploy Preview https://deploy-preview-2131--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 12, 2026
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff for EnqueueAndWait looks massive, but this is mostly due to indentation changes. I wrapped the existing distribution loop inside the WithConnection closure to extend the lease scope. Would recommend reviewing this with "Hide Whitespace" enabled.

@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

/cc @aishukamal

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@LukeAVanDrie: GitHub didn't allow me to request PR reviews from the following users: aishukamal.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @aishukamal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 13, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 13, 2026
@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

@ahg-g These test will probably fail (deadlock) until #2127 is merged. That's expected as #2127 is a prerequisite (just with no conflicting code overlap).

This commit refactors the FlowController's request lifecycle management
to hold the Flow Registry lease (`WithConnection`) for the entire
duration of the request, including the queueing phase.

Previously, the lease was only held during the instantaneous
distribution phase. If a flow had requests waiting in a queue (e.g.,
during scale-from-zero) but no new incoming traffic, the registry would
incorrectly identify the flow as Idle and garbage collect it, orphaning
the queued requests.

Changes:
- Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop
  and `awaitFinalization`.
- Updated `ActiveFlowConnection` interface to expose `FlowKey()`,
  preventing data clumps in internal signatures.
- Refactored `selectDistributionCandidates` to use the active connection
  instead of re-acquiring it.
- Added a regression test (`Regression_LeaseHeldDuringQueueing`)
  ensuring the lease remains valid while the processor blocks.

This change relies on the optimistic concurrency model introduced in PR
 kubernetes-sigs#2127 to ensure that holding long-lived leases does not block the
Garbage Collector or cause writer starvation.
@LukeAVanDrie LukeAVanDrie force-pushed the fix/1982-extend-flow-lease branch from f0ff9e7 to 583dc10 Compare January 16, 2026 23:03
@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

@ahg-g These test will probably fail (deadlock) until #2127 is merged. That's expected as #2127 is a prerequisite (just with no conflicting code overlap).

This is ready for review now that #2127 is merged.

@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

/remove-hold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2026
return types.QueueOutcomeRejectedOther, fmt.Errorf("%w: %w", types.ErrRejected, types.ErrFlowControllerNotRunning)
default:
}
var finalOutcome types.QueueOutcome
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we initialize this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It initializes by default (it's an iota) to the semantically correct value, so this should be safe.

// QueueOutcome represents the high-level final state of a request's lifecycle within the `controller.FlowController`.
//
// It is returned by `FlowController.EnqueueAndWait()` along with a corresponding error. This enum is designed to be a
// low-cardinality label ideal for metrics, while the error provides fine-grained details for non-dispatched outcomes.
type QueueOutcome int

const (
	// QueueOutcomeNotYetFinalized indicates the request has not yet been finalized by the `controller.FlowController`.
	// This is an internal default value and should never be returned by `FlowController.EnqueueAndWait()`.
	QueueOutcomeNotYetFinalized QueueOutcome = iota
	...
)
...

@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 21, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2026
@k8s-ci-robot k8s-ci-robot merged commit 194cc33 into kubernetes-sigs:main Jan 21, 2026
11 checks passed
elevran pushed a commit to llm-d/llm-d-inference-scheduler that referenced this pull request Apr 23, 2026
…rence-extension#2131)

This commit refactors the FlowController's request lifecycle management
to hold the Flow Registry lease (`WithConnection`) for the entire
duration of the request, including the queueing phase.

Previously, the lease was only held during the instantaneous
distribution phase. If a flow had requests waiting in a queue (e.g.,
during scale-from-zero) but no new incoming traffic, the registry would
incorrectly identify the flow as Idle and garbage collect it, orphaning
the queued requests.

Changes:
- Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop
  and `awaitFinalization`.
- Updated `ActiveFlowConnection` interface to expose `FlowKey()`,
  preventing data clumps in internal signatures.
- Refactored `selectDistributionCandidates` to use the active connection
  instead of re-acquiring it.
- Added a regression test (`Regression_LeaseHeldDuringQueueing`)
  ensuring the lease remains valid while the processor blocks.

This change relies on the optimistic concurrency model introduced in PR
 kubernetes-sigs/gateway-api-inference-extension#2127 to ensure that holding long-lived leases does not block the
Garbage Collector or cause writer starvation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rare Race Condition: Premature Flow GC causes Orphaned Queues and Request Starvation

3 participants