fix(predictedlatency): keep prefillTokensInFlight from drifting negative#2848
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| return v.(*atomic.Int64) | ||
| } | ||
|
|
||
| // decrementPodCounter subtracts delta from the counter at key with a hard |
There was a problem hiding this comment.
We work with endpoints not pods, a pod may have one or more endpoint in DP mode. so I suggest to rename to endpoint
There was a problem hiding this comment.
If there are instances in the predicted latency set of plugins that use the term pod, please update that to endpoint in a followup PR
There was a problem hiding this comment.
fixed this for now, will clean it up in a follow up PR
f0ea31b to
5ea53e9
Compare
The predicted-latency producer's prefillTokensInFlight counter can get stuck deeply negative, after which every bulk prediction call fails the sidecar's `greater_than_equal: 0` validation with HTTP 422 and the latency-prediction scheduling pipeline is effectively dead. An EPP restart doesn't help because the same race fires again on the next pod startup. Trigger (reproduced on a test cluster): At EPP startup, the EPP's gRPC readiness can pass a few ms before the prediction sidecar is actually answering requests. Envoy has a persistent ext-proc connection to the service, so real traffic floods in the instant EPP becomes ready. For the first ~500 ms of the pod's life, every request's PrepareRequestData calls generatePredictions, which sends an HTTP POST to the sidecar. The sidecar's port is bound but the handler isn't answering yet, so the Go HTTP client waits up to HTTPTimeout (10s by default). The director's runPrepareDataPlugins has its own 400 ms deadline, so after 400 ms it returns "timed out" to the request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's goroutine kept running with a live context. The downstream hooks then raced with the plugin: 1. PreRequest runs on the now-scheduled request, fails to read the SLO context (the plugin goroutine has not published it yet), and skips incrementing prefillTokensInFlight. 2. The plugin goroutine eventually returns from its HTTP call and stores the SLO context in the ttlcache. 3. ResponseBody fires at EOS, reads the (late-published) context, and unconditionally decrements prefillTokensInFlight. Each orphan pair subtracts one request's input length from the endpoint's counter with no matching increment. Reproduced on a test cluster: 19 PrepareData timeouts across a 490 ms startup window drove the counter negative, producing 192 "prefill_tokens_in_flight: negative" log lines and 281 subsequent bulk prediction 422s even though the sidecar was fully healthy from second two onwards. Under real production load the startup burst is much larger (hundreds of requests) and drifts the counter millions of tokens negative in one shot. Steady-state CPU-induced timeouts contribute an additional slow drip on top, but the one-shot startup burst is the dominant source. kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer: 1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now recreates the orphan-decrement race even under cooperative cancellation. 2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each endpoint's decrement is gated on its own dispatch snapshot rather than a combined OR. 3. decrementEndpointCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract. Adds regression tests for each layer: - TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx. - TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the startup race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the endpoint's counter must not go below zero. - TestDecrementEndpointCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path.
5ea53e9 to
15bdac6
Compare
|
/lgtm |
…ive (kubernetes-sigs#2848) The predicted-latency producer's prefillTokensInFlight counter can get stuck deeply negative, after which every bulk prediction call fails the sidecar's `greater_than_equal: 0` validation with HTTP 422 and the latency-prediction scheduling pipeline is effectively dead. An EPP restart doesn't help because the same race fires again on the next pod startup. Trigger (reproduced on a test cluster): At EPP startup, the EPP's gRPC readiness can pass a few ms before the prediction sidecar is actually answering requests. Envoy has a persistent ext-proc connection to the service, so real traffic floods in the instant EPP becomes ready. For the first ~500 ms of the pod's life, every request's PrepareRequestData calls generatePredictions, which sends an HTTP POST to the sidecar. The sidecar's port is bound but the handler isn't answering yet, so the Go HTTP client waits up to HTTPTimeout (10s by default). The director's runPrepareDataPlugins has its own 400 ms deadline, so after 400 ms it returns "timed out" to the request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's goroutine kept running with a live context. The downstream hooks then raced with the plugin: 1. PreRequest runs on the now-scheduled request, fails to read the SLO context (the plugin goroutine has not published it yet), and skips incrementing prefillTokensInFlight. 2. The plugin goroutine eventually returns from its HTTP call and stores the SLO context in the ttlcache. 3. ResponseBody fires at EOS, reads the (late-published) context, and unconditionally decrements prefillTokensInFlight. Each orphan pair subtracts one request's input length from the endpoint's counter with no matching increment. Reproduced on a test cluster: 19 PrepareData timeouts across a 490 ms startup window drove the counter negative, producing 192 "prefill_tokens_in_flight: negative" log lines and 281 subsequent bulk prediction 422s even though the sidecar was fully healthy from second two onwards. Under real production load the startup burst is much larger (hundreds of requests) and drifts the counter millions of tokens negative in one shot. Steady-state CPU-induced timeouts contribute an additional slow drip on top, but the one-shot startup burst is the dominant source. kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer: 1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now recreates the orphan-decrement race even under cooperative cancellation. 2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each endpoint's decrement is gated on its own dispatch snapshot rather than a combined OR. 3. decrementEndpointCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract. Adds regression tests for each layer: - TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx. - TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the startup race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the endpoint's counter must not go below zero. - TestDecrementEndpointCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path. (cherry picked from commit c307ce7)
…ive (kubernetes-sigs#2848) The predicted-latency producer's prefillTokensInFlight counter can get stuck deeply negative, after which every bulk prediction call fails the sidecar's `greater_than_equal: 0` validation with HTTP 422 and the latency-prediction scheduling pipeline is effectively dead. An EPP restart doesn't help because the same race fires again on the next pod startup. Trigger (reproduced on a test cluster): At EPP startup, the EPP's gRPC readiness can pass a few ms before the prediction sidecar is actually answering requests. Envoy has a persistent ext-proc connection to the service, so real traffic floods in the instant EPP becomes ready. For the first ~500 ms of the pod's life, every request's PrepareRequestData calls generatePredictions, which sends an HTTP POST to the sidecar. The sidecar's port is bound but the handler isn't answering yet, so the Go HTTP client waits up to HTTPTimeout (10s by default). The director's runPrepareDataPlugins has its own 400 ms deadline, so after 400 ms it returns "timed out" to the request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's goroutine kept running with a live context. The downstream hooks then raced with the plugin: 1. PreRequest runs on the now-scheduled request, fails to read the SLO context (the plugin goroutine has not published it yet), and skips incrementing prefillTokensInFlight. 2. The plugin goroutine eventually returns from its HTTP call and stores the SLO context in the ttlcache. 3. ResponseBody fires at EOS, reads the (late-published) context, and unconditionally decrements prefillTokensInFlight. Each orphan pair subtracts one request's input length from the endpoint's counter with no matching increment. Reproduced on a test cluster: 19 PrepareData timeouts across a 490 ms startup window drove the counter negative, producing 192 "prefill_tokens_in_flight: negative" log lines and 281 subsequent bulk prediction 422s even though the sidecar was fully healthy from second two onwards. Under real production load the startup burst is much larger (hundreds of requests) and drifts the counter millions of tokens negative in one shot. Steady-state CPU-induced timeouts contribute an additional slow drip on top, but the one-shot startup burst is the dominant source. kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer: 1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now recreates the orphan-decrement race even under cooperative cancellation. 2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each endpoint's decrement is gated on its own dispatch snapshot rather than a combined OR. 3. decrementEndpointCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract. Adds regression tests for each layer: - TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx. - TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the startup race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the endpoint's counter must not go below zero. - TestDecrementEndpointCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path. (cherry picked from commit c307ce7) Minor rebase conflict: upstream/main renamed schedulingtypes.LLMRequest → schedulingtypes.InferenceRequest and the helper createTestLLMRequest → createTestInferenceRequest. release-1.5 still uses the pre-rename names, so the three call sites added by the new regression tests were adjusted back to createTestLLMRequest to compile against release-1.5. No behavior changes.
…ive (#2848) (#2853) The predicted-latency producer's prefillTokensInFlight counter can get stuck deeply negative, after which every bulk prediction call fails the sidecar's `greater_than_equal: 0` validation with HTTP 422 and the latency-prediction scheduling pipeline is effectively dead. An EPP restart doesn't help because the same race fires again on the next pod startup. Trigger (reproduced on a test cluster): At EPP startup, the EPP's gRPC readiness can pass a few ms before the prediction sidecar is actually answering requests. Envoy has a persistent ext-proc connection to the service, so real traffic floods in the instant EPP becomes ready. For the first ~500 ms of the pod's life, every request's PrepareRequestData calls generatePredictions, which sends an HTTP POST to the sidecar. The sidecar's port is bound but the handler isn't answering yet, so the Go HTTP client waits up to HTTPTimeout (10s by default). The director's runPrepareDataPlugins has its own 400 ms deadline, so after 400 ms it returns "timed out" to the request handler and continues — but until #2847 landed, the plugin's goroutine kept running with a live context. The downstream hooks then raced with the plugin: 1. PreRequest runs on the now-scheduled request, fails to read the SLO context (the plugin goroutine has not published it yet), and skips incrementing prefillTokensInFlight. 2. The plugin goroutine eventually returns from its HTTP call and stores the SLO context in the ttlcache. 3. ResponseBody fires at EOS, reads the (late-published) context, and unconditionally decrements prefillTokensInFlight. Each orphan pair subtracts one request's input length from the endpoint's counter with no matching increment. Reproduced on a test cluster: 19 PrepareData timeouts across a 490 ms startup window drove the counter negative, producing 192 "prefill_tokens_in_flight: negative" log lines and 281 subsequent bulk prediction 422s even though the sidecar was fully healthy from second two onwards. Under real production load the startup burst is much larger (hundreds of requests) and drifts the counter millions of tokens negative in one shot. Steady-state CPU-induced timeouts contribute an additional slow drip on top, but the one-shot startup burst is the dominant source. #2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer: 1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now recreates the orphan-decrement race even under cooperative cancellation. 2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each endpoint's decrement is gated on its own dispatch snapshot rather than a combined OR. 3. decrementEndpointCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract. Adds regression tests for each layer: - TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx. - TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the startup race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the endpoint's counter must not go below zero. - TestDecrementEndpointCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path. (cherry picked from commit c307ce7) Minor rebase conflict: upstream/main renamed schedulingtypes.LLMRequest → schedulingtypes.InferenceRequest and the helper createTestLLMRequest → createTestInferenceRequest. release-1.5 still uses the pre-rename names, so the three call sites added by the new regression tests were adjusted back to createTestLLMRequest to compile against release-1.5. No behavior changes.
…ive (kubernetes-sigs/gateway-api-inference-extension#2848) The predicted-latency producer's prefillTokensInFlight counter can get stuck deeply negative, after which every bulk prediction call fails the sidecar's `greater_than_equal: 0` validation with HTTP 422 and the latency-prediction scheduling pipeline is effectively dead. An EPP restart doesn't help because the same race fires again on the next pod startup. Trigger (reproduced on a test cluster): At EPP startup, the EPP's gRPC readiness can pass a few ms before the prediction sidecar is actually answering requests. Envoy has a persistent ext-proc connection to the service, so real traffic floods in the instant EPP becomes ready. For the first ~500 ms of the pod's life, every request's PrepareRequestData calls generatePredictions, which sends an HTTP POST to the sidecar. The sidecar's port is bound but the handler isn't answering yet, so the Go HTTP client waits up to HTTPTimeout (10s by default). The director's runPrepareDataPlugins has its own 400 ms deadline, so after 400 ms it returns "timed out" to the request handler and continues — but until kubernetes-sigs/gateway-api-inference-extension#2847 landed, the plugin's goroutine kept running with a live context. The downstream hooks then raced with the plugin: 1. PreRequest runs on the now-scheduled request, fails to read the SLO context (the plugin goroutine has not published it yet), and skips incrementing prefillTokensInFlight. 2. The plugin goroutine eventually returns from its HTTP call and stores the SLO context in the ttlcache. 3. ResponseBody fires at EOS, reads the (late-published) context, and unconditionally decrements prefillTokensInFlight. Each orphan pair subtracts one request's input length from the endpoint's counter with no matching increment. Reproduced on a test cluster: 19 PrepareData timeouts across a 490 ms startup window drove the counter negative, producing 192 "prefill_tokens_in_flight: negative" log lines and 281 subsequent bulk prediction 422s even though the sidecar was fully healthy from second two onwards. Under real production load the startup burst is much larger (hundreds of requests) and drifts the counter millions of tokens negative in one shot. Steady-state CPU-induced timeouts contribute an additional slow drip on top, but the one-shot startup burst is the dominant source. kubernetes-sigs/gateway-api-inference-extension#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer: 1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now recreates the orphan-decrement race even under cooperative cancellation. 2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each endpoint's decrement is gated on its own dispatch snapshot rather than a combined OR. 3. decrementEndpointCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract. Adds regression tests for each layer: - TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx. - TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the startup race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the endpoint's counter must not go below zero. - TestDecrementEndpointCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path.
The predicted-latency producer can be left holding an SLO context whose PreRequest counterpart never ran, when PrepareData times out and the director moves on. The plugin's PrepareRequestData goroutine eventually finishes and publishes the context; ResponseBody later finds it and unconditionally decrements prefillTokensInFlight. Each orphan path subtracts one request's input length from the pod's counter with no matching increment, so under sustained load and at startup when the EPP is ready but the prediction sidecars are not yet ready the counter drifts deeply negative. Once that happens, every subsequent bulk prediction request carries a negative prefill_tokens_in_flight feature and the prediction sidecar rejects it with HTTP 422 "greater_than_equal: 0", killing the latency-prediction scheduling pipeline until the EPP pod is restarted.
#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer:
PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now would create the orphan-decrement race even with cooperative cancellation.
ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each pod's decrement is gated on its own dispatch snapshot rather than a combined OR.
decrementPodCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract.
Adds regression tests for each layer: