A potentially better pro-activeness can be achieved by a Throughput Analyzer, that based on the available metrics asses the maximum throughput of a vLLM replica instance, in terms of tokens per second, related to the ‘current’ requests characteristic (num of prompt tokens, output tokens and requests per seconds rate). The existing Saturation Analyzer is based on current KV cache utilization and queue depth and can only detect overload after it has occurred, while the Throughput Analyzer can detect when the current demand trajectory will exhaust capacity — before saturation occurs.
To achieve this, we need to develop a ‘good’ algorithm to assess the throughput as a function of available metrics collected on recent time intervals. The algorithm needs to be validated with numerous experiments on real vLLM system to get to a reasonable accuracy for the throughput assessment, and then of course be implemented within the new Throughput Analyzer and tested within e2e benchmarks.
A potentially better pro-activeness can be achieved by a Throughput Analyzer, that based on the available metrics asses the maximum throughput of a vLLM replica instance, in terms of tokens per second, related to the ‘current’ requests characteristic (num of prompt tokens, output tokens and requests per seconds rate). The existing Saturation Analyzer is based on current KV cache utilization and queue depth and can only detect overload after it has occurred, while the Throughput Analyzer can detect when the current demand trajectory will exhaust capacity — before saturation occurs.
To achieve this, we need to develop a ‘good’ algorithm to assess the throughput as a function of available metrics collected on recent time intervals. The algorithm needs to be validated with numerous experiments on real vLLM system to get to a reasonable accuracy for the throughput assessment, and then of course be implemented within the new Throughput Analyzer and tested within e2e benchmarks.