Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Oct 15, 2025

Following similar approach to #23391.

Throughput benchmarks using the same json schema as #23224:

vllm serve Qwen/Qwen3-1.7B --uvicorn-log-level=error  --no-enable-prefix-caching

python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio $ratio --request-rate 200 --max-concurrency 800 --num-prompts 4000 --json-schema-path ./test3.json  --output-len 128
Test Executor / pct struct reqs -> 0.0 0.2 0.8 1.0
main uniproc 103.16 92.57 70.68 69.36
This PR uniproc 103.19 99.67 87.90 85.28
This PR + --async-scheduling uniproc 132.72 106.08 93.59 90.34
This PR + --async-scheduling multiproc 133.31 114.67 96.08 93.42

This is a breaking change for the model runner and scheduler interfaces.

@mergify
Copy link

mergify bot commented Oct 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 17, 2025
@njhill njhill force-pushed the async-sched-struct-output branch from 829ef60 to 8cba549 Compare October 17, 2025 01:40
@mergify mergify bot removed the needs-rebase label Oct 17, 2025
…tput

Signed-off-by: Nick Hill <[email protected]>

# Conflicts:
#	vllm/v1/engine/core.py
#	vllm/v1/executor/abstract.py
#	vllm/v1/executor/ray_distributed_executor.py
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the effort.

@njhill njhill enabled auto-merge (squash) October 31, 2025 23:14
@njhill njhill merged commit 0cdbe7b into vllm-project:main Nov 1, 2025
54 checks passed
@njhill njhill deleted the async-sched-struct-output branch November 1, 2025 03:25
zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025
@ys950902
Copy link
Contributor

ys950902 commented Nov 7, 2025

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

@njhill
Copy link
Member Author

njhill commented Nov 7, 2025

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

Thanks @ys950902. Which commit exactly were you testing? There was some known perf regression from this PR which was subsequently fixed in #28012. Unfortunately, that PR was just reverted due to a compatibility bug, but the re-apply of it #28319 should be merged to main soon.

It would be great if you could check whether the degraded performance still shows up when that PR is included (if it wasn't already in your test). If so could you open a new issue with the above detail and we can investigate further.

grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
# Block-wait for execute to return (continues running async on the GPU).
with self.log_error_detail(scheduler_output):
exec_result = exec_future.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we blocking before batch queue is full? won't this break the batch queue behavior?

Copy link
Contributor

@weireweire weireweire Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look? PP mode will block here, none parallel will happen. Even though here is intent to wait for the model_execute, but the previous sample_tokens task should also in the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ys950902 is your PP perf issue solved? is it also related to the blocking here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill Could you help answer this question? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

draft fix: #28286

wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Kurumi5210 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Signed-off-by: Kurumi5210 <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Nov 29, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed structured-output suppress-bc-linter tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants