rename slo-aware-router to predicted-latency#2183
rename slo-aware-router to predicted-latency#2183k8s-ci-robot merged 11 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
related to #2032 Once this merges will update the docs in a follow up PR. |
ahg-g
left a comment
There was a problem hiding this comment.
Why did we need to change tpot to itl? is itl the more popular term now? note that we use normalized_time_per_output_token in one of the metrics we currently report.
It would have been easier to review this PR if we did the slo -> predicted latency rename and then follow that up with the tpot -> itl.
| itlSLOHeaderKey = "x-slo-itl-ms" | ||
|
|
||
| // Sheddable header string | ||
| sheddableHeaderKey = "x-request-sheddable" |
|
|
||
| // parseBoolHeader retrieves a header by name, parses it as a bool, | ||
| // and returns the value or an error if the header is missing or invalid. | ||
| func parseBoolHeader(request schedulingtypes.LLMRequest, headerName string) (bool, error) { |
There was a problem hiding this comment.
why do we need this? if this is renaming PR, no new logic should be expected in this PR.
| limitations under the License. | ||
| */ | ||
|
|
||
| package predicted_latency |
There was a problem hiding this comment.
It seems this is a net new file with net new logic, is this related to the renaming?
There was a problem hiding this comment.
yes, it was some shedding related changes i was working on and does not belong ot this PR. I removed those. Now this PR just renames slo-aware-router to predicted-latency
Yes let me split this PR into 2 focusing first on renaming the scorer. There is a subtle difference between itl and TPOT as reported by vLLM (ITL is per-token, and TPOT is request-weighted average of ITL). What we are measuring and optimizing for in the model is ITL. |
| targetPod := sloCtx.targetMetadata | ||
| prefix_cache_score := sloCtx.prefixCacheScoresForEndpoints[targetPod.String()] | ||
| targetPod := predictedLatencyCtx.targetMetadata | ||
| prefix_cache_score := predictedLatencyCtx.prefixCacheScoresForEndpoints[targetPod.NamespacedName.Name] |
There was a problem hiding this comment.
Was this a bug?
We should use targetPod.NamespacedName.String(), a pod name is not unique across namespaces, but we can do as a followup in a separate PR since we should do this in all places where the prefix cache is used.
There was a problem hiding this comment.
yes this was a bug, targetPod.String() broke the prefix cache scoring logic in the latency predictor. I think it happened during the renaming of pods to endpoints (not sure, need to look back at the history)
There was a problem hiding this comment.
So we have a gap in tests then that we should address in a followup PR; pls also update the key to use the full namespaced name in another follow up PR
|
|
||
| // sheddable indicates if the request can be shed if no valid endpoint is available. | ||
| sheddable bool | ||
|
|
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…-api-inference-extension#2183) * rename slo-aware-router to predicted-latency and tpot to itl * fix fmt error * remove unused files * fix test * fix lint errors * remove tpot renaming * remmove shedding related changes * remmove shedding related changes 2 * fix sidecar renaming * removed sheddable var
This pull request replaces the old SLO-aware router plugin with a new predicted latency plugin in the scheduling framework. The main focus is on renaming, refactoring, and updating all relevant code and registration logic to use the new
predicted_latencyplugin, ensuring consistency across the codebase.Plugin replacement and registration:
slo_aware_router) is replaced with the predicted latency plugin (predicted_latency) in the plugin registration logic withinrunner.go. All imports and registrations now reference the new plugin. [1] [2]Refactoring and renaming for consistency:
slo_aware_routerare renamed topredicted_latency, including package names, struct names, and variable names (e.g.,SLOAwareRouter→PredictedLatency,sloRequestContext→predictedLatencyCtx). [1] [2] [3] [4] [5] [6] [7]Function and variable updates:
predictedLatencyCtxand related identifiers instead of the old SLO-aware router names. [1] [2] [3] [4] [5]These changes modernize the latency prediction plugin and ensure naming consistency throughout the codebase.