-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Core] Async scheduling + structured outputs compatibility #26866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
829ef60 to
8cba549
Compare
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…tput # Conflicts: # vllm/v1/engine/core.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…hed-struct-output # Conflicts: # vllm/v1/core/sched/output.py # vllm/v1/worker/gpu_worker.py # vllm/v1/worker/tpu_worker.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…tput Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/executor/abstract.py # vllm/v1/executor/ray_distributed_executor.py
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the effort.
Signed-off-by: Nick Hill <[email protected]>
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
|
Hi @njhill, I found some performance drop for pipeline-parallism scenarios after your pr merged. Do you have some ideas about it, thanks in advance for your great support. And below is the command to launch the server: The command for send the request: And the perf drop from 617.26 tok/s to 384.40 tok/s. |
Thanks @ys950902. Which commit exactly were you testing? There was some known perf regression from this PR which was subsequently fixed in #28012. Unfortunately, that PR was just reverted due to a compatibility bug, but the re-apply of it #28319 should be merged to main soon. It would be great if you could check whether the degraded performance still shows up when that PR is included (if it wasn't already in your test). If so could you open a new issue with the above detail and we can investigate further. |
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
| grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output) | ||
| # Block-wait for execute to return (continues running async on the GPU). | ||
| with self.log_error_detail(scheduler_output): | ||
| exec_result = exec_future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we blocking before batch queue is full? won't this break the batch queue behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have a look? PP mode will block here, none parallel will happen. Even though here is intent to wait for the model_execute, but the previous sample_tokens task should also in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ys950902 is your PP perf issue solved? is it also related to the blocking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill Could you help answer this question? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draft fix: #28286
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Signed-off-by: Kurumi5210 <[email protected]>
…ect#26866) Signed-off-by: Nick Hill <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
Following similar approach to #23391.
Throughput benchmarks using the same json schema as #23224:
--async-scheduling--async-schedulingThis is a breaking change for the model runner and scheduler interfaces.