[CORE] concurrent partial prefills #2372

Csrayz · 2025-08-14T07:35:29Z

What this PR does / why we need it?

When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled.

This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card.

Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests.

Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios.

Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). The results of the benchmark are as follows.

python benchmarks/benchmark_serving.py \ 
--host "xx" \ 
--port 80 \ 
--model /model/Qwen3-8B/ \ 
--dataset-name "custom" \ 
--dataset-path ${test_case} \ 
--metric-percentiles 80,85,90,95,99 \ 
--max-concurrency 40 ‍

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@1dfea5f

gemini-code-assist

Code Review

This pull request introduces a mechanism to limit concurrent partial prefills for long prompts in the AscendScheduler, which is a great feature for improving Time To First Token (TTFT) in mixed-load scenarios. The implementation looks solid and correctly follows the logic described. I've found one high-severity issue regarding configuration validation that should be addressed.

vllm_ascend/core/schedule_config.py

github-actions · 2025-08-14T07:41:47Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

xueliangyang-oeuler · 2025-08-22T07:24:01Z

@wangxiyuan 有劳看下，有需要调整的，可以及时修改。

xueliangyang-oeuler

Nice work.

frankie-ys

good！

frankie-ys

good job

wangxiyuan · 2025-08-22T07:43:58Z

Thanks for the PR! can you rebase to main to make CI pass?

github-actions · 2025-08-23T11:46:54Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

codecov · 2025-08-25T05:17:31Z

Codecov Report

❌ Patch coverage is 87.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.65%. Comparing base (0767d51) to head (2fe1204).
⚠️ Report is 112 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/core/schedule_config.py	68.75%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2372      +/-   ##
==========================================
- Coverage   78.49%   72.65%   -5.84%     
==========================================
  Files         132      147      +15     
  Lines       17806    21845    +4039     
==========================================
+ Hits        13976    15871    +1895     
- Misses       3830     5974    +2144

Flag	Coverage Δ
unittests	`72.65% <87.50%> (-5.84%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Csrayz · 2025-08-25T08:10:30Z

pipeline [multicard e2e test (linux-aarch64-a2-2, v0.10.1.1) (pull_request)] Failing after 93m

Error: (VllmWorker TP0 pid=64935) ERROR 08-25 07:26:59 [multiproc_executor.py:559] [ERROR] 2025-08-25-07:26:58 (PID:64935, Device:0, RankID:-1) ERR02200 DIST call hccl api failed.

Run again?

Csrayz · 2025-08-27T01:00:30Z

pipeline [multicard e2e test (linux-aarch64-a2-2, v0.10.1.1) (pull_request)] Failing after 93m

Error: (VllmWorker TP0 pid=64935) ERROR 08-25 07:26:59 [multiproc_executor.py:559] [ERROR] 2025-08-25-07:26:58 (PID:64935, Device:0, RankID:-1) ERR02200 DIST call hccl api failed.

Run again?

Can this pipeline be rerun? Based on the error, the pipeline failure seems unrelated to code changes. @wangxiyuan

Csrayz · 2025-08-29T06:07:08Z

This pipeline has various issues, and these non-code-related errors are causing the pipeline to fail. The same code, which previously failed the multi-GPU e2e test, now fails due to the single-GPU e2e failing to start after rerunning the pipeline.

github-actions · 2025-09-10T00:50:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-09-18T06:39:27Z

CI is back now. You can just rebase fix the merge conflict

Signed-off-by: Csrayz <[email protected]>

Modify assert according to code review comments Signed-off-by: Csrayz <[email protected]>

Signed-off-by: Csrayz <[email protected]>

Csrayz · 2025-09-20T02:07:45Z

The issues with pipeline failures and conflicts have been resolved.

wangxiyuan · 2025-09-24T09:11:56Z

docs/source/user_guide/configuration/additional_config.md

 | `enable_pd_transfer` | bool | `False` | Whether to enable pd transfer. When using it, decode is started only when prefill of all requests is done. This option only takes effects on offline inference. |
 | `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when enable pd transfer. This option only takes effects when enable_pd_transfer is True. |
+| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
+| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |


these two config is supporeted by vLLM by default. So we don't need to add them here. See L66

# What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@1dfea5f --------- Signed-off-by: Csrayz <[email protected]>

…3675) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR #2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]>

# What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@1dfea5f --------- Signed-off-by: Csrayz <[email protected]> Signed-off-by: luolun <[email protected]>

…llm-project#3675) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR vllm-project#2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]> Signed-off-by: luolun <[email protected]>

# What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@1dfea5f --------- Signed-off-by: Csrayz <[email protected]> Signed-off-by: luolun <[email protected]>

# What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@1dfea5f --------- Signed-off-by: Csrayz <[email protected]> Signed-off-by: hwhaokun <[email protected]>

…llm-project#3675) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR vllm-project#2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]> Signed-off-by: hwhaokun <[email protected]>

# What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@1dfea5f --------- Signed-off-by: Csrayz <[email protected]> Signed-off-by: nsdie <[email protected]>

…llm-project#3675) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR vllm-project#2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: hw_whx <[email protected]> Co-authored-by: hw_whx <[email protected]> Signed-off-by: nsdie <[email protected]>

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

vllm_ascend/core/schedule_config.py Outdated Show resolved Hide resolved

github-actions bot added the documentation Improvements or additions to documentation label Aug 14, 2025

xueliangyang-oeuler approved these changes Aug 22, 2025

View reviewed changes

frankie-ys reviewed Aug 22, 2025

View reviewed changes

frankie-ys approved these changes Aug 22, 2025

View reviewed changes

github-actions bot added merge-conflicts module:tests labels Aug 23, 2025

Csrayz force-pushed the feat_conprefill branch from cff106e to af40bb2 Compare August 25, 2025 04:35

github-actions bot removed the merge-conflicts label Aug 25, 2025

Csrayz force-pushed the feat_conprefill branch from 0c92117 to 413af46 Compare August 25, 2025 04:51

Csrayz force-pushed the feat_conprefill branch 2 times, most recently from f1365e9 to 2fe1204 Compare August 29, 2025 04:53

github-actions bot added the merge-conflicts label Sep 10, 2025

Csrayz added 7 commits September 19, 2025 16:03

[FEAT] Resolve Conflicts

5196839

Signed-off-by: Csrayz <[email protected]>

[Doc] new ascend_scheduler_config

c541636

Signed-off-by: Csrayz <[email protected]>

[Fix] fix some error

4dd58e5

Signed-off-by: Csrayz <[email protected]>

[FIX] assert -> raise ValueError

3214611

Modify assert according to code review comments Signed-off-by: Csrayz <[email protected]>

[CI] Add some compatibility code to pass CI

12ab877

Signed-off-by: Csrayz <[email protected]>

[CI] Resolve Conflict

ef672ef

Signed-off-by: Csrayz <[email protected]>

[LINT] pre-commit lint

df8271e

Signed-off-by: Csrayz <[email protected]>

Csrayz force-pushed the feat_conprefill branch from 2fe1204 to df8271e Compare September 19, 2025 10:48

github-actions bot removed the merge-conflicts label Sep 19, 2025

Yikun added ready read for review ready-for-test start test by label for PR labels Sep 20, 2025

wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Sep 24, 2025

wangxiyuan approved these changes Sep 24, 2025

View reviewed changes

wangxiyuan reviewed Sep 24, 2025

View reviewed changes

wangxiyuan merged commit 80524f5 into vllm-project:main Sep 24, 2025
77 of 81 checks passed

whx-sjtu mentioned this pull request Oct 23, 2025

[BugFix][Core] Fix a bug running multi-modal with ascend_scheduler #3675

Merged

[CORE] concurrent partial prefills #2372

[CORE] concurrent partial prefills #2372

Uh oh!

Conversation

Csrayz commented Aug 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

xueliangyang-oeuler commented Aug 22, 2025

Uh oh!

xueliangyang-oeuler left a comment

Choose a reason for hiding this comment

Uh oh!

frankie-ys left a comment

Choose a reason for hiding this comment

Uh oh!

frankie-ys left a comment

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 23, 2025

Uh oh!

codecov bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Csrayz commented Aug 25, 2025

Uh oh!

Csrayz commented Aug 27, 2025

Uh oh!

Csrayz commented Aug 29, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

wangxiyuan commented Sep 18, 2025

Uh oh!

Csrayz commented Sep 20, 2025

Uh oh!

wangxiyuan Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Csrayz commented Aug 14, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 25, 2025 •

edited

Loading