Skip to content

Token in flight features#2563

Merged
k8s-ci-robot merged 5 commits intokubernetes-sigs:mainfrom
tomatillo-and-multiverse:token-in-flight-features
Mar 13, 2026
Merged

Token in flight features#2563
k8s-ci-robot merged 5 commits intokubernetes-sigs:mainfrom
tomatillo-and-multiverse:token-in-flight-features

Conversation

@kaushikmitr
Copy link
Copy Markdown
Contributor

This pull request adds support for "token-in-flight" features to both the prediction and training servers in the latency predictor codebase. These features (prefill_tokens_in_flight and decode_tokens_in_flight) are now included in model inputs, feature preparation, and model training when enabled via environment variables. The changes are designed to improve the accuracy of latency prediction models by incorporating these new signals, while maintaining backward compatibility when the feature is disabled.

Token-in-flight feature integration:

  • Added ENABLE_TOKEN_IN_FLIGHT_FEATURES environment variable to PredictSettings and Settings classes, controlling the inclusion of token-in-flight features. (latencypredictor/prediction_server.py, latencypredictor/training_server.py) [1] [2]
  • Introduced prefill_tokens_in_flight and decode_tokens_in_flight fields to the PredictionRequest model, and ensured they are included in batch prediction and feature preparation logic. (latencypredictor/prediction_server.py) [1] [2] [3] [4] [5]

Training pipeline updates:

  • Modified feature preparation, model training, and metrics emission to conditionally include token-in-flight columns, updating feature lists and default values throughout the training pipeline. (latencypredictor/training_server.py) [1] [2] [3] [4] [5] [6] [7] [8] [9]

These changes ensure that the models can leverage token-in-flight features for improved latency prediction, with seamless fallback to previous behavior when the feature is disabled.

@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 12, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 1d30a3e
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69b4581fb8e65700081612de
😎 Deploy Preview https://deploy-preview-2563--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 12, 2026
generatedTokenCounts := make([]int, len(candidateEndpoints))
prefixCacheScores := make([]float64, len(candidateEndpoints))
prefillTokensInFlights := make([]int64, len(candidateEndpoints))
decodeTokensInFlights := make([]int64, len(candidateEndpoints)) // always zero; decode load captured by kv_cache_percentage
Copy link
Copy Markdown
Contributor

@BenjaminBraunDev BenjaminBraunDev Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean by "always zero"? If it's truly always 0, then what is the purpose

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Yes, having decodeTokensInFlight always set to 0 in the prediction path was confusing. I removed it from the prediction path entirely. The Python prediction server defaults decode_tokens_in_flight to 0, so omitting it from the Go side is safe.

We've kept DecodeTokensInFlight as a field in the PredictionRequest struct and in the training path intentionally, it's a feature we may want to support in the future. The reason it's 0 right now is that for non-streaming requests (or mixed streaming/non-streaming workloads), we can't accurately track how many decode tokens are in flight since tokens aren't being streamed back. Rather than send inaccurate data, we default to 0 and let the model learn without that signal for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Sounds good to me.

Copy link
Copy Markdown
Contributor

@BenjaminBraunDev BenjaminBraunDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenjaminBraunDev, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit fd0a926 into kubernetes-sigs:main Mar 13, 2026
11 checks passed
BizerNotNull pushed a commit to BizerNotNull/gateway-api-inference-extension that referenced this pull request Mar 15, 2026
* latency predictor: add prefill/decode tokens in flight features

* latency predictor: add token in flight features to python servers and tests

* decrement tokens in flignt on ttl

* fix fmt error

* remove decode tokens in flight
elevran pushed a commit to llm-d/llm-d-inference-scheduler that referenced this pull request Apr 23, 2026
…sion#2563)

* latency predictor: add prefill/decode tokens in flight features

* latency predictor: add token in flight features to python servers and tests

* decrement tokens in flignt on ttl

* fix fmt error

* remove decode tokens in flight
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants