|
1 | 1 |
|
2 | 2 | # Nightly benchmark |
3 | 3 |
|
4 | | -The main goal of this benchmarking is two-fold: |
5 | | -- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload. |
6 | | -- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md](). |
7 | | - |
8 | | - |
9 | | -## Docker images |
10 | | - |
11 | | -We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images: |
12 | | -- vllm/vllm-openai:v0.5.0.post1 |
13 | | -- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 |
14 | | -- openmmlab/lmdeploy:v0.5.0 |
15 | | -- ghcr.io/huggingface/text-generation-inference:2.1 |
16 | | - |
17 | | -<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. --> |
18 | | - |
19 | | - |
20 | | -## Hardware |
21 | | - |
22 | | -One AWS node with 8x NVIDIA A100 GPUs. |
23 | | - |
24 | | - |
25 | | -## Workload description |
26 | | - |
27 | | -We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload: |
28 | | - |
29 | | -- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed). |
30 | | -- Output length: the corresponding output length of these 500 prompts. |
31 | | -- Models: llama-3 8B, llama-3 70B, mixtral 8x7B. |
32 | | -- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed). |
33 | | -- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). |
34 | | - |
35 | | -<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. --> |
36 | | - |
37 | | -## Plots |
38 | | - |
39 | | -In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed. |
40 | | - |
41 | | -<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 > |
42 | | - |
43 | | -## Results |
44 | | - |
45 | | -{nightly_results_benchmarking_table} |
| 4 | +This benchmark aims to: |
| 5 | +- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload. |
| 6 | +- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions. |
| 7 | + |
| 8 | +Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end. |
| 9 | + |
| 10 | +Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176) |
| 11 | + |
| 12 | + |
| 13 | +## Setup |
| 14 | + |
| 15 | +- Docker images: |
| 16 | + - vLLM: `vllm/vllm-openai:v0.6.2` |
| 17 | + - SGLang: `lmsysorg/sglang:v0.3.2-cu121` |
| 18 | + - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` |
| 19 | + - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` |
| 20 | + - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* |
| 21 | + - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. |
| 22 | +- Hardware |
| 23 | + - 8x Nvidia A100 GPUs |
| 24 | +- Workload: |
| 25 | + - Dataset |
| 26 | + - ShareGPT dataset |
| 27 | + - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) |
| 28 | + - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) |
| 29 | + - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. |
| 30 | + - Models: llama-3 8B, llama-3 70B. |
| 31 | + - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). |
| 32 | + - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. |
| 33 | + - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. |
| 34 | + - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). |
| 35 | + |
| 36 | +# Known issues |
| 37 | + |
| 38 | +- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105). |
| 39 | +- TGI does not support `ignore-eos` flag. |
0 commit comments