Token in flight features#2563
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| generatedTokenCounts := make([]int, len(candidateEndpoints)) | ||
| prefixCacheScores := make([]float64, len(candidateEndpoints)) | ||
| prefillTokensInFlights := make([]int64, len(candidateEndpoints)) | ||
| decodeTokensInFlights := make([]int64, len(candidateEndpoints)) // always zero; decode load captured by kv_cache_percentage |
There was a problem hiding this comment.
What does this comment mean by "always zero"? If it's truly always 0, then what is the purpose
There was a problem hiding this comment.
Thanks for pointing this out! Yes, having decodeTokensInFlight always set to 0 in the prediction path was confusing. I removed it from the prediction path entirely. The Python prediction server defaults decode_tokens_in_flight to 0, so omitting it from the Go side is safe.
We've kept DecodeTokensInFlight as a field in the PredictionRequest struct and in the training path intentionally, it's a feature we may want to support in the future. The reason it's 0 right now is that for non-streaming requests (or mixed streaming/non-streaming workloads), we can't accurately track how many decode tokens are in flight since tokens aren't being streamed back. Rather than send inaccurate data, we default to 0 and let the model learn without that signal for now.
There was a problem hiding this comment.
Ah, got it. Sounds good to me.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenjaminBraunDev, kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* latency predictor: add prefill/decode tokens in flight features * latency predictor: add token in flight features to python servers and tests * decrement tokens in flignt on ttl * fix fmt error * remove decode tokens in flight
…sion#2563) * latency predictor: add prefill/decode tokens in flight features * latency predictor: add token in flight features to python servers and tests * decrement tokens in flignt on ttl * fix fmt error * remove decode tokens in flight
This pull request adds support for "token-in-flight" features to both the prediction and training servers in the latency predictor codebase. These features (
prefill_tokens_in_flightanddecode_tokens_in_flight) are now included in model inputs, feature preparation, and model training when enabled via environment variables. The changes are designed to improve the accuracy of latency prediction models by incorporating these new signals, while maintaining backward compatibility when the feature is disabled.Token-in-flight feature integration:
ENABLE_TOKEN_IN_FLIGHT_FEATURESenvironment variable toPredictSettingsandSettingsclasses, controlling the inclusion of token-in-flight features. (latencypredictor/prediction_server.py,latencypredictor/training_server.py) [1] [2]prefill_tokens_in_flightanddecode_tokens_in_flightfields to thePredictionRequestmodel, and ensured they are included in batch prediction and feature preparation logic. (latencypredictor/prediction_server.py) [1] [2] [3] [4] [5]Training pipeline updates:
latencypredictor/training_server.py) [1] [2] [3] [4] [5] [6] [7] [8] [9]These changes ensure that the models can leverage token-in-flight features for improved latency prediction, with seamless fallback to previous behavior when the feature is disabled.