Throughput Analyzer

A potentially better pro-activeness can be achieved by a Throughput Analyzer, that based on the available metrics asses the maximum throughput of a vLLM replica instance, in terms of tokens per second, related to the ‘current’ requests characteristic (num of prompt tokens, output tokens and requests per seconds rate). The existing Saturation Analyzer is based on current KV cache utilization and queue depth and can only detect overload after it has occurred, while the Throughput Analyzer can detect when the current demand trajectory will exhaust capacity — before saturation occurs. 

To achieve this, we need to develop a ‘good’ algorithm to assess the throughput as a function of available metrics collected on recent time intervals.  The algorithm needs to be validated with numerous experiments on real vLLM system to get to a reasonable accuracy for the throughput assessment, and then of course be implemented within the new Throughput Analyzer and tested within e2e benchmarks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput Analyzer #834

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Throughput Analyzer #834

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions