controller: extend flow lease scope to fix orphaned queues #1982#2131
Conversation
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
The diff for EnqueueAndWait looks massive, but this is mostly due to indentation changes. I wrapped the existing distribution loop inside the WithConnection closure to extend the lease scope. Would recommend reviewing this with "Hide Whitespace" enabled.
|
/cc @aishukamal |
|
@LukeAVanDrie: GitHub didn't allow me to request PR reviews from the following users: aishukamal. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
This commit refactors the FlowController's request lifecycle management to hold the Flow Registry lease (`WithConnection`) for the entire duration of the request, including the queueing phase. Previously, the lease was only held during the instantaneous distribution phase. If a flow had requests waiting in a queue (e.g., during scale-from-zero) but no new incoming traffic, the registry would incorrectly identify the flow as Idle and garbage collect it, orphaning the queued requests. Changes: - Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop and `awaitFinalization`. - Updated `ActiveFlowConnection` interface to expose `FlowKey()`, preventing data clumps in internal signatures. - Refactored `selectDistributionCandidates` to use the active connection instead of re-acquiring it. - Added a regression test (`Regression_LeaseHeldDuringQueueing`) ensuring the lease remains valid while the processor blocks. This change relies on the optimistic concurrency model introduced in PR kubernetes-sigs#2127 to ensure that holding long-lived leases does not block the Garbage Collector or cause writer starvation.
f0ff9e7 to
583dc10
Compare
|
/remove-hold |
| return types.QueueOutcomeRejectedOther, fmt.Errorf("%w: %w", types.ErrRejected, types.ErrFlowControllerNotRunning) | ||
| default: | ||
| } | ||
| var finalOutcome types.QueueOutcome |
There was a problem hiding this comment.
It initializes by default (it's an iota) to the semantically correct value, so this should be safe.
// QueueOutcome represents the high-level final state of a request's lifecycle within the `controller.FlowController`.
//
// It is returned by `FlowController.EnqueueAndWait()` along with a corresponding error. This enum is designed to be a
// low-cardinality label ideal for metrics, while the error provides fine-grained details for non-dispatched outcomes.
type QueueOutcome int
const (
// QueueOutcomeNotYetFinalized indicates the request has not yet been finalized by the `controller.FlowController`.
// This is an internal default value and should never be returned by `FlowController.EnqueueAndWait()`.
QueueOutcomeNotYetFinalized QueueOutcome = iota
...
)
...
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…rence-extension#2131) This commit refactors the FlowController's request lifecycle management to hold the Flow Registry lease (`WithConnection`) for the entire duration of the request, including the queueing phase. Previously, the lease was only held during the instantaneous distribution phase. If a flow had requests waiting in a queue (e.g., during scale-from-zero) but no new incoming traffic, the registry would incorrectly identify the flow as Idle and garbage collect it, orphaning the queued requests. Changes: - Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop and `awaitFinalization`. - Updated `ActiveFlowConnection` interface to expose `FlowKey()`, preventing data clumps in internal signatures. - Refactored `selectDistributionCandidates` to use the active connection instead of re-acquiring it. - Added a regression test (`Regression_LeaseHeldDuringQueueing`) ensuring the lease remains valid while the processor blocks. This change relies on the optimistic concurrency model introduced in PR kubernetes-sigs/gateway-api-inference-extension#2127 to ensure that holding long-lived leases does not block the Garbage Collector or cause writer starvation.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a race condition (Issue #1982) where the Flow Registry Garbage Collector could delete a Flow while it still had active requests waiting in a queue.
The Bug
Previously,
EnqueueAndWaitonly acquired a Registry lease during the distribution phase (selecting a shard). Once the request was submitted to a shard's queue, the lease was released. In scenarios like "Scale from Zero," where requests sit in queues for longer thanFlowGCTimeoutwithout new incoming traffic, the Registry would see the flow as "Idle" (0 leases) and delete it. This orphaned the queues, causing requests to time out silently.The Fix
We now extend the scope of
WithConnectionto cover the entire request lifecycle, from distribution until finalization (dispatch or timeout).Safety Note
This change is only safe because of PR #2127 (Optimistic Concurrency).
leaseCount > 0via an atomic load and skips the flow without blocking.Which issue(s) this PR fixes:
Fixes #1982
Reviewer Notes:
FlowKey()to theActiveFlowConnectioncontract. While not strictly necessary, this allows us to pass just the connection object down the stack, enforcing encapsulation.Regression_LeaseHeldDuringQueueingtocontroller_test.go, which uses channels to deterministically prove the lease is not released while the processor is blocked.Does this PR introduce a user-facing change?:
/hold
Waiting for PR #2127