Skip to content

Throughput Analyzer #834

@biranofer

Description

@biranofer

A potentially better pro-activeness can be achieved by a Throughput Analyzer, that based on the available metrics asses the maximum throughput of a vLLM replica instance, in terms of tokens per second, related to the ‘current’ requests characteristic (num of prompt tokens, output tokens and requests per seconds rate). The existing Saturation Analyzer is based on current KV cache utilization and queue depth and can only detect overload after it has occurred, while the Throughput Analyzer can detect when the current demand trajectory will exhaust capacity — before saturation occurs.

To achieve this, we need to develop a ‘good’ algorithm to assess the throughput as a function of available metrics collected on recent time intervals. The algorithm needs to be validated with numerous experiments on real vLLM system to get to a reasonable accuracy for the throughput assessment, and then of course be implemented within the new Throughput Analyzer and tested within e2e benchmarks.

Metadata

Metadata

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions