[Core] Async scheduling + structured outputs compatibility #26866

njhill · 2025-10-15T02:04:06Z

Following similar approach to #23391.

Throughput benchmarks using the same json schema as #23224:

vllm serve Qwen/Qwen3-1.7B --uvicorn-log-level=error  --no-enable-prefix-caching

python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio $ratio --request-rate 200 --max-concurrency 800 --num-prompts 4000 --json-schema-path ./test3.json  --output-len 128

Test	Executor / pct struct reqs ->	0.0	0.2	0.8	1.0
main	uniproc	103.16	92.57	70.68	69.36
This PR	uniproc	103.19	99.67	87.90	85.28
This PR + `--async-scheduling`	uniproc	132.72	106.08	93.59	90.34
This PR + `--async-scheduling`	multiproc	133.31	114.67	96.08	93.42

This is a breaking change for the model runner and scheduler interfaces.

mergify · 2025-10-17T01:07:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Nick Hill <[email protected]>

…tput # Conflicts: # vllm/v1/engine/core.py

Signed-off-by: Nick Hill <[email protected]>

…tput

Signed-off-by: Nick Hill <[email protected]>

…tput

…hed-struct-output # Conflicts: # vllm/v1/core/sched/output.py # vllm/v1/worker/gpu_worker.py # vllm/v1/worker/tpu_worker.py

Signed-off-by: Nick Hill <[email protected]>

…tput

Signed-off-by: Nick Hill <[email protected]>

…tput Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/executor/abstract.py # vllm/v1/executor/ray_distributed_executor.py

…tput

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Nick Hill <[email protected]>

…tput

WoosukKwon

LGTM! Thanks for the effort.

Signed-off-by: Nick Hill <[email protected]>

…tput

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

ys950902 · 2025-11-07T07:36:46Z

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

njhill · 2025-11-07T20:04:53Z

Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support.

And below is the command to launch the server:

VLLM_USE_V1=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --enforce-eager --port 8000 --host 0.0.0.0 -pp 2 --distributed_executor_backend=mp --trust-remote-code --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --quantization fp8    --dtype=float16   -tp=2

The command for send the request:

python3 -m vllm.entrypoints.cli.main bench serve  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --ready-check-timeout-sec 1 --dataset-name random --random-input-len=1024 --random-output-len=512 --ignore-eos --port=8000 --host 0.0.0.0 --num-prompt 30 --request-rate inf --backend vllm --trust-remote-code

And the perf drop from 617.26 tok/s to 384.40 tok/s.

Thanks @ys950902. Which commit exactly were you testing? There was some known perf regression from this PR which was subsequently fixed in #28012. Unfortunately, that PR was just reverted due to a compatibility bug, but the re-apply of it #28319 should be merged to main soon.

It would be great if you could check whether the degraded performance still shows up when that PR is included (if it wasn't already in your test). If so could you open a new issue with the above detail and we can investigate further.

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

weireweire · 2025-11-10T10:11:01Z

vllm/v1/engine/core.py

+                grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
+                # Block-wait for execute to return (continues running async on the GPU).
+                with self.log_error_detail(scheduler_output):
+                    exec_result = exec_future.result()


why do we blocking before batch queue is full? won't this break the batch queue behavior?

Could you have a look? PP mode will block here, none parallel will happen. Even though here is intent to wait for the model_execute, but the previous sample_tokens task should also in the queue.

@ys950902 is your PP perf issue solved? is it also related to the blocking here?

@njhill Could you help answer this question? Thanks!

draft fix: #28286

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Signed-off-by: Kurumi5210 <[email protected]>

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>

mergify bot added structured-output v1 tpu Related to Google TPUs kv-connector labels Oct 15, 2025

github-project-automation bot added this to Structured Output Oct 15, 2025

njhill added the suppress-bc-linter label Oct 15, 2025

mergify bot added the needs-rebase label Oct 17, 2025

njhill added 5 commits October 16, 2025 18:40

[Core] Async scheduling + structured outputs compatibility

d5d7924

Signed-off-by: Nick Hill <[email protected]>

small fixes

9810947

Signed-off-by: Nick Hill <[email protected]>

misc code improvement

bc33394

Signed-off-by: Nick Hill <[email protected]>

simplify with context manager

e5f9634

Signed-off-by: Nick Hill <[email protected]>

readability/simplification updates

8cba549

Signed-off-by: Nick Hill <[email protected]>

njhill force-pushed the async-sched-struct-output branch from 829ef60 to 8cba549 Compare October 17, 2025 01:40

mergify bot removed the needs-rebase label Oct 17, 2025

njhill added 13 commits October 16, 2025 18:50

include sample_tokens() when logging error details

66906ff

Signed-off-by: Nick Hill <[email protected]>

reorg logic a bit for readability

ac87699

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

885760b

…tput # Conflicts: # vllm/v1/engine/core.py

update comment

eef1d44

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

2d17506

…tput

TPU updates

ac60de7

Signed-off-by: Nick Hill <[email protected]>

add ray compatibility

01eec54

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

866a281

…tput

Merge remote-tracking branch 'refs/remotes/origin/main' into async-sc…

717fbad

…hed-struct-output # Conflicts: # vllm/v1/core/sched/output.py # vllm/v1/worker/gpu_worker.py # vllm/v1/worker/tpu_worker.py

fix import and test

0c03cb2

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

f6b3318

…tput

test updates

0127d64

Signed-off-by: Nick Hill <[email protected]>

add to e2e async scheduling test

b8208bd

Signed-off-by: Nick Hill <[email protected]>

njhill mentioned this pull request Oct 21, 2025

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Merged

5 tasks

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

09090a6

…tput Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/executor/abstract.py # vllm/v1/executor/ray_distributed_executor.py

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

ca5b20c

…tput

WoosukKwon reviewed Oct 31, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

njhill added 2 commits October 31, 2025 13:27

Add some comments, update WorkerBase, some simpler formatting

311e48d

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

defcffd

…tput

WoosukKwon approved these changes Oct 31, 2025

View reviewed changes

njhill added 2 commits October 31, 2025 14:25

fix import

d756533

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into async-sched-struct-ou…

7af5dc4

…tput

njhill enabled auto-merge (squash) October 31, 2025 23:14

njhill merged commit 0cdbe7b into vllm-project:main Nov 1, 2025
54 checks passed

github-project-automation bot moved this to Done in Structured Output Nov 1, 2025

njhill deleted the async-sched-struct-output branch November 1, 2025 03:25

py4 mentioned this pull request Nov 3, 2025

[Runner] Separate execute_model and sample_tokens to adapt upstream change. vllm-project/tpu-inference#1003

Merged

njhill mentioned this pull request Nov 4, 2025

[PerfFix] Avoid separate thread for MP executor shm spin #28012

Merged

zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025

[Core] Async scheduling + structured outputs compatibility (vllm-proj…

7a05ee4

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

sixiang-google mentioned this pull request Nov 8, 2025

Update execute_model to support async scheduling in vllm vllm-project/tpu-inference#1047

Merged

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Core] Async scheduling + structured outputs compatibility (vllm-proj…

8f25269

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Core] Async scheduling + structured outputs compatibility (vllm-proj…

3f488dd

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

weireweire mentioned this pull request Nov 10, 2025

[Bug]: Pipeline parallel doesn't really do the "parallel" among GPUs. #28270

Closed

1 task

weireweire reviewed Nov 10, 2025

View reviewed changes

heheda12345 mentioned this pull request Nov 11, 2025

[BugFix]Fix the issue where there is no parallelism in PP mode #28286

Open

5 tasks

njhill mentioned this pull request Nov 15, 2025

[BugFix] Fix PP performance and PP kv connector output regression #28768

Merged

njhill mentioned this pull request Nov 22, 2025

[RFC]: Restructure the core loop to allow more asynchrony #23233

Closed

1 task

wangxiyuan mentioned this pull request Nov 25, 2025

upgrade to vllm 0.11.2 vllm-project/vllm-ascend#4400

Merged

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Core] Async scheduling + structured outputs compatibility (vllm-proj…

b28ccba

…ect#26866) Signed-off-by: Nick Hill <[email protected]>

zhandaz mentioned this pull request Dec 1, 2025

Update the docs for --async-scheduling compatibility vllm-project/recipes#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Async scheduling + structured outputs compatibility #26866

[Core] Async scheduling + structured outputs compatibility #26866

Uh oh!

njhill commented Oct 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 17, 2025

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

ys950902 commented Nov 7, 2025 •

edited by njhill

Loading

Uh oh!

njhill commented Nov 7, 2025

Uh oh!

weireweire Nov 10, 2025

Uh oh!

weireweire Nov 11, 2025 •

edited

Loading

Uh oh!

weireweire Nov 13, 2025

Uh oh!

nvpohanh Nov 14, 2025

Uh oh!

weireweire Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Core] Async scheduling + structured outputs compatibility #26866

[Core] Async scheduling + structured outputs compatibility #26866

Uh oh!

Conversation

njhill commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 17, 2025

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ys950902 commented Nov 7, 2025 • edited by njhill Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Nov 7, 2025

Uh oh!

weireweire Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

weireweire Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weireweire Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

weireweire Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhill commented Oct 15, 2025 •

edited by github-actions bot

Loading

ys950902 commented Nov 7, 2025 •

edited by njhill

Loading

weireweire Nov 11, 2025 •

edited

Loading