Skip to content

support dynamic adaptation of scheduler plugins configuration #1992

@nirrozenbaum

Description

@nirrozenbaum

What would you like to be added:
starting from the basic building block of scorers. scorers are usually defined in one of two main categories:

Category 1:
Context-aware - which includes scorers such as session-aware, estimated prefix-cache aware, and KV-cache–aware. These scorers use different levels of knowledge to estimate KV-cache locality on serving pods, with precision improving as information becomes more granular. For example, a session-aware strategy may rely on a session-id header that maps to a growing chat, while a KV-cache–aware strategy consumes direct cache events from vLLM pods.

Category 2:
Load-aware - which focuses on metrics such as queue lengths, active request counts, or KV-cache memory utilization, aiming to evenly distribute inference requests, prevent hotspots, and maximize throughput.

These categories often pull in opposite directions: context-aware scorers bias toward sticky routing to maximize cache hits, while load-aware scorers spread requests to minimize queuing delays. Striking the right balance is critical for minimizing latency and maximizing efficiency — but static weights are fragile and often suboptimal.

we should introduce dynamic adaptation of active scorers and/or their weights to optimize performance.

Why is this needed:
optimize performance of EPP. we should provide benchmarks to proof the performance is indeed improved when introducing this capability.

Metadata

Metadata

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions