Skip to content
This repository was archived by the owner on Sep 4, 2025. It is now read-only.

Commit 7e5d11b

Browse files
vaibhavjainwizXaenalt
authored andcommitted
Merge pull request #134 from vaibhavjainwiz/sync_vllm
Sync vllm with upstream/v0.5.5 to odh/main for 2.13
2 parents b0e81ce + fcd968c commit 7e5d11b

File tree

533 files changed

+37500
-8075
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

533 files changed

+37500
-8075
lines changed

.buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ tasks:
99
value: 0.664
1010
limit: 1000
1111
num_fewshot: 5
12+
trust_remote_code: True

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ tasks:
44
- name: "gsm8k"
55
metrics:
66
- name: "exact_match,strict-match"
7-
value: 0.409
7+
value: 0.419
88
- name: "exact_match,flexible-extract"
9-
value: 0.406
9+
value: 0.416
1010
limit: 1000
1111
num_fewshot: 5
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nvidia/Minitron-4B-Base -b auto -l 1000 -f 5 -t 1
2-
model_name: "nvidia/Minitron-4B-Base"
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
2+
model_name: "mgoin/Minitron-4B-Base-FP8"
33
tasks:
44
- name: "gsm8k"
55
metrics:
66
- name: "exact_match,strict-match"
7-
value: 0.252
7+
value: 0.233
88
- name: "exact_match,flexible-extract"
9-
value: 0.252
9+
value: 0.236
1010
limit: 1000
1111
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
44
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
55
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
66
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
7-
Minitron-4B-Base.yaml
7+
Minitron-4B-Base-FP8.yaml
88
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
99
Qwen2-1.5B-Instruct-FP8W8.yaml
1010
Meta-Llama-3-8B-QQQ.yaml

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
import numpy
1515
import yaml
1616

17-
RTOL = 0.02
17+
RTOL = 0.05
1818
TEST_DATA_FILE = os.environ.get(
1919
"LM_EVAL_TEST_DATA_FILE",
2020
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
@@ -23,9 +23,12 @@
2323

2424

2525
def launch_lm_eval(eval_config):
26+
trust_remote_code = eval_config.get('trust_remote_code', False)
27+
2628
model_args = f"pretrained={eval_config['model_name']}," \
2729
f"tensor_parallel_size={TP_SIZE}," \
28-
f"add_bos_token=true"
30+
f"add_bos_token=true," \
31+
f"trust_remote_code={trust_remote_code}"
2932

3033
results = lm_eval.simple_evaluate(
3134
model="vllm",

.buildkite/nightly-benchmarks/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,18 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
3434

3535
Performance benchmark will be triggered when:
3636
- A PR being merged into vllm.
37-
- Every commit for those PRs with `perf-benchmarks` label.
37+
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
3838

3939
Nightly benchmark will be triggered when:
40-
- Every commit for those PRs with `nightly-benchmarks` label.
40+
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4141

4242

4343

4444

4545
## Performance benchmark details
4646

47-
See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
47+
48+
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4849

4950

5051
#### Latency test
@@ -68,7 +69,7 @@ Here is an example of one test inside `latency-tests.json`:
6869

6970
In this example:
7071
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
71-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
72+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7273

7374
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7475

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ steps:
2121
containers:
2222
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
2323
command:
24-
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
24+
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
2525
resources:
2626
limits:
2727
nvidia.com/gpu: 8

.buildkite/nightly-benchmarks/tests/descriptions.md renamed to .buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,42 @@
11

22
## Latency tests
33

4-
This test suite aims to test vllm's end-to-end latency under a controlled setup.
5-
64
- Input length: 32 tokens.
75
- Output length: 128 tokens.
86
- Batch size: fixed (8).
9-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
7+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
108
- Evaluation metrics: end-to-end latency (mean, median, p99).
119

12-
### Latency benchmarking results
1310

1411
{latency_tests_markdown_table}
1512

16-
## Throughput tests
1713

18-
This test suite aims to test vllm's throughput.
14+
## Throughput tests
1915

2016
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
2117
- Output length: the corresponding output length of these 200 prompts.
2218
- Batch size: dynamically determined by vllm to achieve maximum throughput.
23-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
19+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2420
- Evaluation metrics: throughput.
2521

26-
### Throughput benchmarking results
2722

2823
{throughput_tests_markdown_table}
2924

30-
## Serving tests
3125

32-
This test suite aims to test vllm's real serving metrics.
26+
## Serving tests
3327

3428
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
3529
- Output length: the corresponding output length of these 200 prompts.
3630
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
3731
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
38-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
32+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
33+
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3934
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
4035

41-
### Serving benchmarking results
4236

4337
{serving_tests_markdown_table}
4438

39+
4540
## json version of the benchmarking tables
4641

4742
This section contains the data of the markdown tables above in JSON format.

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,8 @@ def results_to_json(latency, throughput, serving):
174174
# document the result
175175
with open(results_folder / "benchmark_results.md", "w") as f:
176176

177-
results = read_markdown(
178-
"../.buildkite/nightly-benchmarks/tests/descriptions.md")
177+
results = read_markdown("../.buildkite/nightly-benchmarks/" +
178+
"performance-benchmarks-descriptions.md")
179179
results = results.format(
180180
latency_tests_markdown_table=latency_md_table,
181181
throughput_tests_markdown_table=throughput_md_table,

.buildkite/nightly-benchmarks/run-benchmarks-suite.sh renamed to .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ check_hf_token() {
3737
ensure_sharegpt_downloaded() {
3838
local FILE=ShareGPT_V3_unfiltered_cleaned_split.json
3939
if [ ! -f "$FILE" ]; then
40-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
40+
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
4141
else
42-
echo "$FILE already exists."
42+
echo "$FILE already exists."
4343
fi
4444
}
4545

@@ -68,35 +68,38 @@ wait_for_server() {
6868
done' && return 0 || return 1
6969
}
7070

71-
kill_gpu_processes() {
72-
# kill all processes on GPU.
73-
pids=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)
74-
if [ -z "$pids" ]; then
75-
echo "No GPU processes found."
71+
kill_processes_launched_by_current_bash() {
72+
# Kill all python processes launched from current bash script
73+
current_shell_pid=$$
74+
processes=$(ps -eo pid,ppid,command | awk -v ppid="$current_shell_pid" -v proc="$1" '$2 == ppid && $3 ~ proc {print $1}')
75+
if [ -n "$processes" ]; then
76+
echo "Killing the following processes matching '$1':"
77+
echo "$processes"
78+
echo "$processes" | xargs kill -9
7679
else
77-
for pid in $pids; do
78-
kill -9 "$pid"
79-
echo "Killed process with PID: $pid"
80-
done
81-
82-
echo "All GPU processes have been killed."
80+
echo "No processes found matching '$1'."
8381
fi
82+
}
83+
84+
kill_gpu_processes() {
8485

85-
# waiting for GPU processes to be fully killed
86-
# loop while nvidia-smi returns any processes
87-
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
86+
ps -aux
87+
lsof -t -i:8000 | xargs -r kill -9
88+
pkill -f pt_main_thread
89+
# this line doesn't work now
90+
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
91+
pkill -f python3
92+
pkill -f /usr/bin/python3
93+
94+
95+
# wait until GPU memory usage smaller than 1GB
96+
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
8897
sleep 1
89-
echo "Waiting for GPU processes to be killed"
9098
done
9199

92100
# remove vllm config file
93101
rm -rf ~/.config/vllm
94102

95-
# Print the GPU memory usage
96-
# so that we know if all GPU processes are killed.
97-
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
98-
# The memory usage should be 0 MB.
99-
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
100103
}
101104

102105
upload_to_buildkite() {
@@ -114,7 +117,7 @@ upload_to_buildkite() {
114117
fi
115118

116119
# Use the determined command to annotate and upload artifacts
117-
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < $RESULTS_FOLDER/benchmark_results.md
120+
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md
118121
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
119122
}
120123

@@ -166,7 +169,7 @@ run_latency_tests() {
166169
latency_command: $latency,
167170
gpu_type: $gpu
168171
}')
169-
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
172+
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
170173

171174
# run the benchmark
172175
eval "$latency_command"
@@ -176,7 +179,6 @@ run_latency_tests() {
176179
done
177180
}
178181

179-
180182
run_throughput_tests() {
181183
# run throughput tests using `benchmark_throughput.py`
182184
# $1: a json file specifying throughput test cases
@@ -224,7 +226,7 @@ run_throughput_tests() {
224226
throughput_command: $command,
225227
gpu_type: $gpu
226228
}')
227-
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
229+
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
228230

229231
# run the benchmark
230232
eval "$throughput_command"
@@ -256,7 +258,6 @@ run_serving_tests() {
256258
continue
257259
fi
258260

259-
260261
# get client and server arguments
261262
server_params=$(echo "$params" | jq -r '.server_parameters')
262263
client_params=$(echo "$params" | jq -r '.client_parameters')
@@ -334,7 +335,7 @@ run_serving_tests() {
334335
client_command: $client,
335336
gpu_type: $gpu
336337
}')
337-
echo "$jq_output" > "$RESULTS_FOLDER/${new_test_name}.commands"
338+
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
338339

339340
done
340341

@@ -351,6 +352,7 @@ main() {
351352
# dependencies
352353
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
353354
(which jq) || (apt-get update && apt-get -y install jq)
355+
(which lsof) || (apt-get update && apt-get install -y lsof)
354356

355357
# get the current IP address, required by benchmark_serving.py
356358
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
@@ -369,7 +371,6 @@ main() {
369371
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
370372
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
371373

372-
373374
# postprocess benchmarking results
374375
pip install tabulate pandas
375376
python3 $QUICK_BENCHMARK_ROOT/scripts/convert-results-json-to-markdown.py

0 commit comments

Comments
 (0)