Skip to content

fix(predictedlatency): keep prefillTokensInFlight from drifting negative#2848

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
tomatillo-and-multiverse:fix/predictedlatency-counter-hardening
Apr 15, 2026
Merged

fix(predictedlatency): keep prefillTokensInFlight from drifting negative#2848
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
tomatillo-and-multiverse:fix/predictedlatency-counter-hardening

Conversation

@kaushikmitr
Copy link
Copy Markdown
Contributor

@kaushikmitr kaushikmitr commented Apr 15, 2026

The predicted-latency producer can be left holding an SLO context whose PreRequest counterpart never ran, when PrepareData times out and the director moves on. The plugin's PrepareRequestData goroutine eventually finishes and publishes the context; ResponseBody later finds it and unconditionally decrements prefillTokensInFlight. Each orphan path subtracts one request's input length from the pod's counter with no matching increment, so under sustained load and at startup when the EPP is ready but the prediction sidecars are not yet ready the counter drifts deeply negative. Once that happens, every subsequent bulk prediction request carries a negative prefill_tokens_in_flight feature and the prediction sidecar rejects it with HTTP 422 "greater_than_equal: 0", killing the latency-prediction scheduling pipeline until the EPP pod is restarted.

#2847 fixed the underlying cancellation hole in the request-control executor so plugins now see ctx.Done() when the deadline fires, but plugins that already commit shared state at the end of PrepareRequestData need to cooperate. Add three layered guards in the predicted-latency producer:

  1. PrepareRequestData: check ctx.Err() before publishing the SLO context. If the director has already cancelled us, the context PreRequest needed never showed up, and storing it now would create the orphan-decrement race even with cooperative cancellation.

  2. ResponseBody: only decrement counters that PreRequest actually incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0 guard the OnEviction handler already uses, instead of decrementing based on prefillTargetMetadata != nil alone. Also tighten the OnEviction guard so each pod's decrement is gated on its own dispatch snapshot rather than a combined OR.

  3. decrementPodCounter helper: a CAS-based clamp at zero that all decrement sites now go through. Decrementing a missing key is a no-op and does not create a zero entry. This is the last line of defense — if any future bug somehow lands a stray decrement, the counter still cannot drift below zero and the prediction sidecar stays inside its validation contract.

Adds regression tests for each layer:

  • TestPrepareRequestData_CancelledContextDoesNotPublish (and a live positive control) verifies the publication guard on a cancelled ctx.
  • TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped reproduces the production race in a unit test: SLO context is published as if PrepareData raced past the director, PreRequest is skipped, ResponseBody fires at EOS, and the pod's counter must not go below zero.
  • TestDecrementPodCounter_FloorAtZero exercises the helper directly: decrementing a missing key, decrementing past zero, decrementing after delete, and the normal partial-decrement path.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 15, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 15bdac6
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69dfb873914e6c00080edbb3
😎 Deploy Preview https://deploy-preview-2848--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 15, 2026
@kaushikmitr
Copy link
Copy Markdown
Contributor Author

@BenjaminBraunDev

return v.(*atomic.Int64)
}

// decrementPodCounter subtracts delta from the counter at key with a hard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We work with endpoints not pods, a pod may have one or more endpoint in DP mode. so I suggest to rename to endpoint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are instances in the predicted latency set of plugins that use the term pod, please update that to endpoint in a followup PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this for now, will clean it up in a follow up PR

@kaushikmitr kaushikmitr force-pushed the fix/predictedlatency-counter-hardening branch from f0ea31b to 5ea53e9 Compare April 15, 2026 16:06
The predicted-latency producer's prefillTokensInFlight counter can get
stuck deeply negative, after which every bulk prediction call fails the
sidecar's `greater_than_equal: 0` validation with HTTP 422 and the
latency-prediction scheduling pipeline is effectively dead. An EPP
restart doesn't help because the same race fires again on the next pod
startup.

Trigger (reproduced on a test cluster):

At EPP startup, the EPP's gRPC readiness can pass a few ms before the
prediction sidecar is actually answering requests. Envoy has a
persistent ext-proc connection to the service, so real traffic floods
in the instant EPP becomes ready. For the first ~500 ms of the pod's
life, every request's PrepareRequestData calls generatePredictions,
which sends an HTTP POST to the sidecar. The sidecar's port is bound
but the handler isn't answering yet, so the Go HTTP client waits up to
HTTPTimeout (10s by default). The director's runPrepareDataPlugins has
its own 400 ms deadline, so after 400 ms it returns "timed out" to the
request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's
goroutine kept running with a live context. The downstream hooks then
raced with the plugin:

1. PreRequest runs on the now-scheduled request, fails to read the SLO
   context (the plugin goroutine has not published it yet), and skips
   incrementing prefillTokensInFlight.
2. The plugin goroutine eventually returns from its HTTP call and
   stores the SLO context in the ttlcache.
3. ResponseBody fires at EOS, reads the (late-published) context, and
   unconditionally decrements prefillTokensInFlight.

Each orphan pair subtracts one request's input length from the
endpoint's counter with no matching increment. Reproduced on a test
cluster: 19 PrepareData timeouts across a 490 ms startup window drove
the counter negative, producing 192 "prefill_tokens_in_flight: negative"
log lines and 281 subsequent bulk prediction 422s even though the
sidecar was fully healthy from second two onwards. Under real
production load the startup burst is much larger (hundreds of
requests) and drifts the counter millions of tokens negative in one
shot. Steady-state CPU-induced timeouts contribute an additional
slow drip on top, but the one-shot startup burst is the dominant
source.

kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control
executor so plugins now see ctx.Done() when the deadline fires, but
plugins that already commit shared state at the end of
PrepareRequestData need to cooperate. Add three layered guards in the
predicted-latency producer:

1. PrepareRequestData: check ctx.Err() before publishing the SLO
   context. If the director has already cancelled us, the context
   PreRequest needed never showed up, and storing it now recreates
   the orphan-decrement race even under cooperative cancellation.

2. ResponseBody: only decrement counters that PreRequest actually
   incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0
   guard the OnEviction handler already uses, instead of decrementing
   based on prefillTargetMetadata != nil alone. Also tighten the
   OnEviction guard so each endpoint's decrement is gated on its own
   dispatch snapshot rather than a combined OR.

3. decrementEndpointCounter helper: a CAS-based clamp at zero that all
   decrement sites now go through. Decrementing a missing key is a
   no-op and does not create a zero entry. This is the last line of
   defense — if any future bug somehow lands a stray decrement, the
   counter still cannot drift below zero and the prediction sidecar
   stays inside its validation contract.

Adds regression tests for each layer:

- TestPrepareRequestData_CancelledContextDoesNotPublish (and a live
  positive control) verifies the publication guard on a cancelled ctx.
- TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped
  reproduces the startup race in a unit test: SLO context is
  published as if PrepareData raced past the director, PreRequest is
  skipped, ResponseBody fires at EOS, and the endpoint's counter must
  not go below zero.
- TestDecrementEndpointCounter_FloorAtZero exercises the helper
  directly: decrementing a missing key, decrementing past zero,
  decrementing after delete, and the normal partial-decrement path.
@kaushikmitr kaushikmitr force-pushed the fix/predictedlatency-counter-hardening branch from 5ea53e9 to 15bdac6 Compare April 15, 2026 16:10
@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Apr 15, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 15, 2026
@k8s-ci-robot k8s-ci-robot merged commit c307ce7 into kubernetes-sigs:main Apr 15, 2026
13 checks passed
kaushikmitr added a commit to tomatillo-and-multiverse/gateway-api-inference-extension-merged that referenced this pull request Apr 15, 2026
…ive (kubernetes-sigs#2848)

The predicted-latency producer's prefillTokensInFlight counter can get
stuck deeply negative, after which every bulk prediction call fails the
sidecar's `greater_than_equal: 0` validation with HTTP 422 and the
latency-prediction scheduling pipeline is effectively dead. An EPP
restart doesn't help because the same race fires again on the next pod
startup.

Trigger (reproduced on a test cluster):

At EPP startup, the EPP's gRPC readiness can pass a few ms before the
prediction sidecar is actually answering requests. Envoy has a
persistent ext-proc connection to the service, so real traffic floods
in the instant EPP becomes ready. For the first ~500 ms of the pod's
life, every request's PrepareRequestData calls generatePredictions,
which sends an HTTP POST to the sidecar. The sidecar's port is bound
but the handler isn't answering yet, so the Go HTTP client waits up to
HTTPTimeout (10s by default). The director's runPrepareDataPlugins has
its own 400 ms deadline, so after 400 ms it returns "timed out" to the
request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's
goroutine kept running with a live context. The downstream hooks then
raced with the plugin:

1. PreRequest runs on the now-scheduled request, fails to read the SLO
   context (the plugin goroutine has not published it yet), and skips
   incrementing prefillTokensInFlight.
2. The plugin goroutine eventually returns from its HTTP call and
   stores the SLO context in the ttlcache.
3. ResponseBody fires at EOS, reads the (late-published) context, and
   unconditionally decrements prefillTokensInFlight.

Each orphan pair subtracts one request's input length from the
endpoint's counter with no matching increment. Reproduced on a test
cluster: 19 PrepareData timeouts across a 490 ms startup window drove
the counter negative, producing 192 "prefill_tokens_in_flight: negative"
log lines and 281 subsequent bulk prediction 422s even though the
sidecar was fully healthy from second two onwards. Under real
production load the startup burst is much larger (hundreds of
requests) and drifts the counter millions of tokens negative in one
shot. Steady-state CPU-induced timeouts contribute an additional
slow drip on top, but the one-shot startup burst is the dominant
source.

kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control
executor so plugins now see ctx.Done() when the deadline fires, but
plugins that already commit shared state at the end of
PrepareRequestData need to cooperate. Add three layered guards in the
predicted-latency producer:

1. PrepareRequestData: check ctx.Err() before publishing the SLO
   context. If the director has already cancelled us, the context
   PreRequest needed never showed up, and storing it now recreates
   the orphan-decrement race even under cooperative cancellation.

2. ResponseBody: only decrement counters that PreRequest actually
   incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0
   guard the OnEviction handler already uses, instead of decrementing
   based on prefillTargetMetadata != nil alone. Also tighten the
   OnEviction guard so each endpoint's decrement is gated on its own
   dispatch snapshot rather than a combined OR.

3. decrementEndpointCounter helper: a CAS-based clamp at zero that all
   decrement sites now go through. Decrementing a missing key is a
   no-op and does not create a zero entry. This is the last line of
   defense — if any future bug somehow lands a stray decrement, the
   counter still cannot drift below zero and the prediction sidecar
   stays inside its validation contract.

Adds regression tests for each layer:

- TestPrepareRequestData_CancelledContextDoesNotPublish (and a live
  positive control) verifies the publication guard on a cancelled ctx.
- TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped
  reproduces the startup race in a unit test: SLO context is
  published as if PrepareData raced past the director, PreRequest is
  skipped, ResponseBody fires at EOS, and the endpoint's counter must
  not go below zero.
- TestDecrementEndpointCounter_FloorAtZero exercises the helper
  directly: decrementing a missing key, decrementing past zero,
  decrementing after delete, and the normal partial-decrement path.

(cherry picked from commit c307ce7)
kaushikmitr added a commit to tomatillo-and-multiverse/gateway-api-inference-extension-merged that referenced this pull request Apr 15, 2026
…ive (kubernetes-sigs#2848)

The predicted-latency producer's prefillTokensInFlight counter can get
stuck deeply negative, after which every bulk prediction call fails the
sidecar's `greater_than_equal: 0` validation with HTTP 422 and the
latency-prediction scheduling pipeline is effectively dead. An EPP
restart doesn't help because the same race fires again on the next pod
startup.

Trigger (reproduced on a test cluster):

At EPP startup, the EPP's gRPC readiness can pass a few ms before the
prediction sidecar is actually answering requests. Envoy has a
persistent ext-proc connection to the service, so real traffic floods
in the instant EPP becomes ready. For the first ~500 ms of the pod's
life, every request's PrepareRequestData calls generatePredictions,
which sends an HTTP POST to the sidecar. The sidecar's port is bound
but the handler isn't answering yet, so the Go HTTP client waits up to
HTTPTimeout (10s by default). The director's runPrepareDataPlugins has
its own 400 ms deadline, so after 400 ms it returns "timed out" to the
request handler and continues — but until kubernetes-sigs#2847 landed, the plugin's
goroutine kept running with a live context. The downstream hooks then
raced with the plugin:

1. PreRequest runs on the now-scheduled request, fails to read the SLO
   context (the plugin goroutine has not published it yet), and skips
   incrementing prefillTokensInFlight.
2. The plugin goroutine eventually returns from its HTTP call and
   stores the SLO context in the ttlcache.
3. ResponseBody fires at EOS, reads the (late-published) context, and
   unconditionally decrements prefillTokensInFlight.

Each orphan pair subtracts one request's input length from the
endpoint's counter with no matching increment. Reproduced on a test
cluster: 19 PrepareData timeouts across a 490 ms startup window drove
the counter negative, producing 192 "prefill_tokens_in_flight: negative"
log lines and 281 subsequent bulk prediction 422s even though the
sidecar was fully healthy from second two onwards. Under real
production load the startup burst is much larger (hundreds of
requests) and drifts the counter millions of tokens negative in one
shot. Steady-state CPU-induced timeouts contribute an additional
slow drip on top, but the one-shot startup burst is the dominant
source.

kubernetes-sigs#2847 fixed the underlying cancellation hole in the request-control
executor so plugins now see ctx.Done() when the deadline fires, but
plugins that already commit shared state at the end of
PrepareRequestData need to cooperate. Add three layered guards in the
predicted-latency producer:

1. PrepareRequestData: check ctx.Err() before publishing the SLO
   context. If the director has already cancelled us, the context
   PreRequest needed never showed up, and storing it now recreates
   the orphan-decrement race even under cooperative cancellation.

2. ResponseBody: only decrement counters that PreRequest actually
   incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0
   guard the OnEviction handler already uses, instead of decrementing
   based on prefillTargetMetadata != nil alone. Also tighten the
   OnEviction guard so each endpoint's decrement is gated on its own
   dispatch snapshot rather than a combined OR.

3. decrementEndpointCounter helper: a CAS-based clamp at zero that all
   decrement sites now go through. Decrementing a missing key is a
   no-op and does not create a zero entry. This is the last line of
   defense — if any future bug somehow lands a stray decrement, the
   counter still cannot drift below zero and the prediction sidecar
   stays inside its validation contract.

Adds regression tests for each layer:

- TestPrepareRequestData_CancelledContextDoesNotPublish (and a live
  positive control) verifies the publication guard on a cancelled ctx.
- TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped
  reproduces the startup race in a unit test: SLO context is
  published as if PrepareData raced past the director, PreRequest is
  skipped, ResponseBody fires at EOS, and the endpoint's counter must
  not go below zero.
- TestDecrementEndpointCounter_FloorAtZero exercises the helper
  directly: decrementing a missing key, decrementing past zero,
  decrementing after delete, and the normal partial-decrement path.

(cherry picked from commit c307ce7)

Minor rebase conflict: upstream/main renamed
schedulingtypes.LLMRequest → schedulingtypes.InferenceRequest and the
helper createTestLLMRequest → createTestInferenceRequest. release-1.5
still uses the pre-rename names, so the three call sites added by the
new regression tests were adjusted back to createTestLLMRequest to
compile against release-1.5. No behavior changes.
k8s-ci-robot pushed a commit that referenced this pull request Apr 15, 2026
…ive (#2848) (#2853)

The predicted-latency producer's prefillTokensInFlight counter can get
stuck deeply negative, after which every bulk prediction call fails the
sidecar's `greater_than_equal: 0` validation with HTTP 422 and the
latency-prediction scheduling pipeline is effectively dead. An EPP
restart doesn't help because the same race fires again on the next pod
startup.

Trigger (reproduced on a test cluster):

At EPP startup, the EPP's gRPC readiness can pass a few ms before the
prediction sidecar is actually answering requests. Envoy has a
persistent ext-proc connection to the service, so real traffic floods
in the instant EPP becomes ready. For the first ~500 ms of the pod's
life, every request's PrepareRequestData calls generatePredictions,
which sends an HTTP POST to the sidecar. The sidecar's port is bound
but the handler isn't answering yet, so the Go HTTP client waits up to
HTTPTimeout (10s by default). The director's runPrepareDataPlugins has
its own 400 ms deadline, so after 400 ms it returns "timed out" to the
request handler and continues — but until #2847 landed, the plugin's
goroutine kept running with a live context. The downstream hooks then
raced with the plugin:

1. PreRequest runs on the now-scheduled request, fails to read the SLO
   context (the plugin goroutine has not published it yet), and skips
   incrementing prefillTokensInFlight.
2. The plugin goroutine eventually returns from its HTTP call and
   stores the SLO context in the ttlcache.
3. ResponseBody fires at EOS, reads the (late-published) context, and
   unconditionally decrements prefillTokensInFlight.

Each orphan pair subtracts one request's input length from the
endpoint's counter with no matching increment. Reproduced on a test
cluster: 19 PrepareData timeouts across a 490 ms startup window drove
the counter negative, producing 192 "prefill_tokens_in_flight: negative"
log lines and 281 subsequent bulk prediction 422s even though the
sidecar was fully healthy from second two onwards. Under real
production load the startup burst is much larger (hundreds of
requests) and drifts the counter millions of tokens negative in one
shot. Steady-state CPU-induced timeouts contribute an additional
slow drip on top, but the one-shot startup burst is the dominant
source.

#2847 fixed the underlying cancellation hole in the request-control
executor so plugins now see ctx.Done() when the deadline fires, but
plugins that already commit shared state at the end of
PrepareRequestData need to cooperate. Add three layered guards in the
predicted-latency producer:

1. PrepareRequestData: check ctx.Err() before publishing the SLO
   context. If the director has already cancelled us, the context
   PreRequest needed never showed up, and storing it now recreates
   the orphan-decrement race even under cooperative cancellation.

2. ResponseBody: only decrement counters that PreRequest actually
   incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0
   guard the OnEviction handler already uses, instead of decrementing
   based on prefillTargetMetadata != nil alone. Also tighten the
   OnEviction guard so each endpoint's decrement is gated on its own
   dispatch snapshot rather than a combined OR.

3. decrementEndpointCounter helper: a CAS-based clamp at zero that all
   decrement sites now go through. Decrementing a missing key is a
   no-op and does not create a zero entry. This is the last line of
   defense — if any future bug somehow lands a stray decrement, the
   counter still cannot drift below zero and the prediction sidecar
   stays inside its validation contract.

Adds regression tests for each layer:

- TestPrepareRequestData_CancelledContextDoesNotPublish (and a live
  positive control) verifies the publication guard on a cancelled ctx.
- TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped
  reproduces the startup race in a unit test: SLO context is
  published as if PrepareData raced past the director, PreRequest is
  skipped, ResponseBody fires at EOS, and the endpoint's counter must
  not go below zero.
- TestDecrementEndpointCounter_FloorAtZero exercises the helper
  directly: decrementing a missing key, decrementing past zero,
  decrementing after delete, and the normal partial-decrement path.

(cherry picked from commit c307ce7)

Minor rebase conflict: upstream/main renamed
schedulingtypes.LLMRequest → schedulingtypes.InferenceRequest and the
helper createTestLLMRequest → createTestInferenceRequest. release-1.5
still uses the pre-rename names, so the three call sites added by the
new regression tests were adjusted back to createTestLLMRequest to
compile against release-1.5. No behavior changes.
elevran pushed a commit to llm-d/llm-d-inference-scheduler that referenced this pull request Apr 23, 2026
…ive (kubernetes-sigs/gateway-api-inference-extension#2848)

The predicted-latency producer's prefillTokensInFlight counter can get
stuck deeply negative, after which every bulk prediction call fails the
sidecar's `greater_than_equal: 0` validation with HTTP 422 and the
latency-prediction scheduling pipeline is effectively dead. An EPP
restart doesn't help because the same race fires again on the next pod
startup.

Trigger (reproduced on a test cluster):

At EPP startup, the EPP's gRPC readiness can pass a few ms before the
prediction sidecar is actually answering requests. Envoy has a
persistent ext-proc connection to the service, so real traffic floods
in the instant EPP becomes ready. For the first ~500 ms of the pod's
life, every request's PrepareRequestData calls generatePredictions,
which sends an HTTP POST to the sidecar. The sidecar's port is bound
but the handler isn't answering yet, so the Go HTTP client waits up to
HTTPTimeout (10s by default). The director's runPrepareDataPlugins has
its own 400 ms deadline, so after 400 ms it returns "timed out" to the
request handler and continues — but until kubernetes-sigs/gateway-api-inference-extension#2847 landed, the plugin's
goroutine kept running with a live context. The downstream hooks then
raced with the plugin:

1. PreRequest runs on the now-scheduled request, fails to read the SLO
   context (the plugin goroutine has not published it yet), and skips
   incrementing prefillTokensInFlight.
2. The plugin goroutine eventually returns from its HTTP call and
   stores the SLO context in the ttlcache.
3. ResponseBody fires at EOS, reads the (late-published) context, and
   unconditionally decrements prefillTokensInFlight.

Each orphan pair subtracts one request's input length from the
endpoint's counter with no matching increment. Reproduced on a test
cluster: 19 PrepareData timeouts across a 490 ms startup window drove
the counter negative, producing 192 "prefill_tokens_in_flight: negative"
log lines and 281 subsequent bulk prediction 422s even though the
sidecar was fully healthy from second two onwards. Under real
production load the startup burst is much larger (hundreds of
requests) and drifts the counter millions of tokens negative in one
shot. Steady-state CPU-induced timeouts contribute an additional
slow drip on top, but the one-shot startup burst is the dominant
source.

kubernetes-sigs/gateway-api-inference-extension#2847 fixed the underlying cancellation hole in the request-control
executor so plugins now see ctx.Done() when the deadline fires, but
plugins that already commit shared state at the end of
PrepareRequestData need to cooperate. Add three layered guards in the
predicted-latency producer:

1. PrepareRequestData: check ctx.Err() before publishing the SLO
   context. If the director has already cancelled us, the context
   PreRequest needed never showed up, and storing it now recreates
   the orphan-decrement race even under cooperative cancellation.

2. ResponseBody: only decrement counters that PreRequest actually
   incremented. Mirror the prefillTokensAtDispatch{,OnPrefill} > 0
   guard the OnEviction handler already uses, instead of decrementing
   based on prefillTargetMetadata != nil alone. Also tighten the
   OnEviction guard so each endpoint's decrement is gated on its own
   dispatch snapshot rather than a combined OR.

3. decrementEndpointCounter helper: a CAS-based clamp at zero that all
   decrement sites now go through. Decrementing a missing key is a
   no-op and does not create a zero entry. This is the last line of
   defense — if any future bug somehow lands a stray decrement, the
   counter still cannot drift below zero and the prediction sidecar
   stays inside its validation contract.

Adds regression tests for each layer:

- TestPrepareRequestData_CancelledContextDoesNotPublish (and a live
  positive control) verifies the publication guard on a cancelled ctx.
- TestPredictedLatency_ResponseBody_NoOrphanDecrement_WhenPreRequestSkipped
  reproduces the startup race in a unit test: SLO context is
  published as if PrepareData raced past the director, PreRequest is
  skipped, ResponseBody fires at EOS, and the endpoint's counter must
  not go below zero.
- TestDecrementEndpointCounter_FloorAtZero exercises the helper
  directly: decrementing a missing key, decrementing past zero,
  decrementing after delete, and the normal partial-decrement path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants