What would you like to be added:
starting from the basic building block of scorers. scorers are usually defined in one of two main categories:
Category 1:
Context-aware - which includes scorers such as session-aware, estimated prefix-cache aware, and KV-cache–aware. These scorers use different levels of knowledge to estimate KV-cache locality on serving pods, with precision improving as information becomes more granular. For example, a session-aware strategy may rely on a session-id header that maps to a growing chat, while a KV-cache–aware strategy consumes direct cache events from vLLM pods.
Category 2:
Load-aware - which focuses on metrics such as queue lengths, active request counts, or KV-cache memory utilization, aiming to evenly distribute inference requests, prevent hotspots, and maximize throughput.
These categories often pull in opposite directions: context-aware scorers bias toward sticky routing to maximize cache hits, while load-aware scorers spread requests to minimize queuing delays. Striking the right balance is critical for minimizing latency and maximizing efficiency — but static weights are fragile and often suboptimal.
we should introduce dynamic adaptation of active scorers and/or their weights to optimize performance.
Why is this needed:
optimize performance of EPP. we should provide benchmarks to proof the performance is indeed improved when introducing this capability.
What would you like to be added:
starting from the basic building block of scorers. scorers are usually defined in one of two main categories:
Category 1:
Context-aware - which includes scorers such as session-aware, estimated prefix-cache aware, and KV-cache–aware. These scorers use different levels of knowledge to estimate KV-cache locality on serving pods, with precision improving as information becomes more granular. For example, a session-aware strategy may rely on a session-id header that maps to a growing chat, while a KV-cache–aware strategy consumes direct cache events from vLLM pods.
Category 2:
Load-aware - which focuses on metrics such as queue lengths, active request counts, or KV-cache memory utilization, aiming to evenly distribute inference requests, prevent hotspots, and maximize throughput.
These categories often pull in opposite directions: context-aware scorers bias toward sticky routing to maximize cache hits, while load-aware scorers spread requests to minimize queuing delays. Striking the right balance is critical for minimizing latency and maximizing efficiency — but static weights are fragile and often suboptimal.
we should introduce dynamic adaptation of active scorers and/or their weights to optimize performance.
Why is this needed:
optimize performance of EPP. we should provide benchmarks to proof the performance is indeed improved when introducing this capability.