Token in flight features by kaushikmitr · Pull Request #2563 · kubernetes-sigs/gateway-api-inference-extension

kaushikmitr · 2026-03-12T20:42:28Z

This pull request adds support for "token-in-flight" features to both the prediction and training servers in the latency predictor codebase. These features (prefill_tokens_in_flight and decode_tokens_in_flight) are now included in model inputs, feature preparation, and model training when enabled via environment variables. The changes are designed to improve the accuracy of latency prediction models by incorporating these new signals, while maintaining backward compatibility when the feature is disabled.

Token-in-flight feature integration:

Added ENABLE_TOKEN_IN_FLIGHT_FEATURES environment variable to PredictSettings and Settings classes, controlling the inclusion of token-in-flight features. (latencypredictor/prediction_server.py, latencypredictor/training_server.py) [1] [2]
Introduced prefill_tokens_in_flight and decode_tokens_in_flight fields to the PredictionRequest model, and ensured they are included in batch prediction and feature preparation logic. (latencypredictor/prediction_server.py) [1] [2] [3] [4] [5]

Training pipeline updates:

Modified feature preparation, model training, and metrics emission to conditionally include token-in-flight columns, updating feature lists and default values throughout the training pipeline. (latencypredictor/training_server.py) [1] [2] [3] [4] [5] [6] [7] [8] [9]

These changes ensure that the models can leverage token-in-flight features for improved latency prediction, with seamless fallback to previous behavior when the feature is disabled.

… tests

netlify · 2026-03-12T20:42:35Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`1d30a3e`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69b4581fb8e65700081612de
😎 Deploy Preview	https://deploy-preview-2563--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

BenjaminBraunDev · 2026-03-13T00:28:48Z

 	generatedTokenCounts := make([]int, len(candidateEndpoints))
 	prefixCacheScores := make([]float64, len(candidateEndpoints))
+	prefillTokensInFlights := make([]int64, len(candidateEndpoints))
+	decodeTokensInFlights := make([]int64, len(candidateEndpoints)) // always zero; decode load captured by kv_cache_percentage


What does this comment mean by "always zero"? If it's truly always 0, then what is the purpose

Thanks for pointing this out! Yes, having decodeTokensInFlight always set to 0 in the prediction path was confusing. I removed it from the prediction path entirely. The Python prediction server defaults decode_tokens_in_flight to 0, so omitting it from the Go side is safe.

We've kept DecodeTokensInFlight as a field in the PredictionRequest struct and in the training path intentionally, it's a feature we may want to support in the future. The reason it's 0 right now is that for non-streaming requests (or mixed streaming/non-streaming workloads), we can't accurately track how many decode tokens are in flight since tokens aren't being streamed back. Rather than send inaccurate data, we default to 0 and let the model learn without that signal for now.

Ah, got it. Sounds good to me.

BenjaminBraunDev

/lgtm

k8s-ci-robot · 2026-03-13T21:19:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenjaminBraunDev, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~latencypredictor/OWNERS~~ [kaushikmitr]
~~pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/OWNERS~~ [kaushikmitr]
~~sidecars/latencypredictorasync/OWNERS~~ [kaushikmitr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* latency predictor: add prefill/decode tokens in flight features * latency predictor: add token in flight features to python servers and tests * decrement tokens in flignt on ttl * fix fmt error * remove decode tokens in flight

…sion#2563) * latency predictor: add prefill/decode tokens in flight features * latency predictor: add token in flight features to python servers and tests * decrement tokens in flignt on ttl * fix fmt error * remove decode tokens in flight

kaushikmitr added 3 commits March 11, 2026 18:17

latency predictor: add prefill/decode tokens in flight features

5909d41

latency predictor: add token in flight features to python servers and…

74b201e

… tests

decrement tokens in flignt on ttl

343713b

k8s-ci-robot requested review from liu-cong and nirrozenbaum March 12, 2026 20:42

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 12, 2026

fix fmt error

67468a6

BenjaminBraunDev reviewed Mar 13, 2026

View reviewed changes

remove decode tokens in flight

1d30a3e

BenjaminBraunDev approved these changes Mar 13, 2026

View reviewed changes

k8s-ci-robot assigned BenjaminBraunDev Mar 13, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2026

k8s-ci-robot merged commit fd0a926 into kubernetes-sigs:main Mar 13, 2026
11 checks passed

kaushikmitr mentioned this pull request Mar 15, 2026

Add prefill and decode tokens-in-flight as latency predictor features #2540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token in flight features#2563

Token in flight features#2563
k8s-ci-robot merged 5 commits intokubernetes-sigs:mainfrom
tomatillo-and-multiverse:token-in-flight-features

kaushikmitr commented Mar 12, 2026

Uh oh!

netlify Bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

BenjaminBraunDev Mar 13, 2026 •

edited

Loading

Uh oh!

kaushikmitr Mar 13, 2026

Uh oh!

BenjaminBraunDev Mar 13, 2026

Uh oh!

BenjaminBraunDev left a comment

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaushikmitr commented Mar 12, 2026

Uh oh!

netlify Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

BenjaminBraunDev Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaushikmitr Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBraunDev Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBraunDev left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify Bot commented Mar 12, 2026 •

edited

Loading

BenjaminBraunDev Mar 13, 2026 •

edited

Loading