Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
cf27c04
fix break by vllm commit: Support LoRA with speculative decoding #21068
leo-pony Nov 12, 2025
a86a6c1
[Hybrid] Pass kernel block size to builders #27753
leo-pony Nov 12, 2025
f1f20cc
fix the main-to-main break by:[Bug] Fix env string 0 same to True #28159
leo-pony Nov 12, 2025
2d83516
[Core] Async scheduling + structured outputs compatibility#26866
leo-pony Nov 12, 2025
ffc519f
fix structure output break bduring adapt to llm: Async scheduling + s…
leo-pony Nov 13, 2025
03cb9fb
fix structured outputs compatibility
22dimensions Nov 13, 2025
e1bbbd8
fix mtp breaks in modelrunner and format fix
leo-pony Nov 13, 2025
cb507ae
fix mypy issues
leo-pony Nov 13, 2025
3e1bbe8
model runner execute model support v0.11.0 branch
leo-pony Nov 13, 2025
399e165
fix format issue
leo-pony Nov 13, 2025
2dd522e
update to releases/v0.11.1
22dimensions Nov 13, 2025
5a50e2a
fix break by vllm:[BugFix][VL] Fix FA selection on Qwen2.5-VL #27790
leo-pony Nov 13, 2025
7f1cd59
fix scheduler
22dimensions Nov 14, 2025
8399298
skip ut, nightly, v0.11.0
leo-pony Nov 14, 2025
0c76f8e
Skip 1th e2e full
leo-pony Nov 14, 2025
f1f4161
skip has tested cases
leo-pony Nov 14, 2025
5bad0f3
Fix vllm break:Support LoRA with speculative decoding:#21068
leo-pony Nov 14, 2025
0764b27
remove skip of nightly a2
leo-pony Nov 14, 2025
7502030
fix the deepseek mtp break
leo-pony Nov 15, 2025
eb2927f
Add comments for deepseek torchair mtp break, vllm PR:27922
leo-pony Nov 16, 2025
47cc8de
Add comments fRestore full test cases
leo-pony Nov 16, 2025
11f6d4e
Enable torchair test in single card full test
leo-pony Nov 17, 2025
4197a11
just to trigger ci test
leo-pony Nov 18, 2025
97df642
fix vllm break by: Enable sequence parallelism matching w/o custom op…
leo-pony Nov 18, 2025
58303c8
vllm break of PR:https://github.com/vllm-project/vllm/pull/24794
leo-pony Nov 19, 2025
dabd793
adapt qwen3 next
22dimensions Nov 14, 2025
ccb837b
fix the undefine splitting_ops in test_sp_for_qwen3_moe
leo-pony Nov 19, 2025
9a95847
make light test take effect
leo-pony Nov 19, 2025
f9f9bf2
fix vllm break: Fix backend selection for encoder-only models (#28534)
leo-pony Nov 19, 2025
40905c8
fix vllm break:Refactor CUDA attention backend selection logic#24794
leo-pony Nov 19, 2025
0a81671
fix break of vllm:Rename clashing method names for vLLM model protoco…
leo-pony Nov 19, 2025
940fb90
fix lint error
leo-pony Nov 19, 2025
3cc8278
fix vllm break: Avoid bytecode hook and simplify TorchCompileWrapperW…
leo-pony Nov 20, 2025
9111fc7
skip A3-nightly test
leo-pony Nov 20, 2025
3a07d19
replace VisionPatchEmbed for better performance
shen-shanshan Nov 14, 2025
c987667
fix
shen-shanshan Nov 14, 2025
6f5cfe1
fix vllm break: [Model] Pass mm_features directly into get_mrope_inpu…
leo-pony Nov 20, 2025
c03dfef
fix vllm break:[Misc] Make SchedulerConfig.max_model_len init-only #2…
leo-pony Nov 20, 2025
d977d4f
fix vllm break:[V1] Support MP Executor for multi node distributed in…
leo-pony Nov 21, 2025
fcfb3a0
update version
wangxiyuan Nov 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/format_pr_body.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:

- name: Get vLLM version
run: |
VLLM_COMMIT=2918c1b49c88c29783c86f78d2c4221cb9622379
VLLM_COMMIT=v0.11.2
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV

- name: Checkout repository
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/vllm_ascend_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
lint:
uses: ./.github/workflows/pre-commit.yml
with:
vllm: 2918c1b49c88c29783c86f78d2c4221cb9622379
vllm: v0.11.2
changes:
runs-on: ubuntu-latest
outputs:
Expand Down Expand Up @@ -83,7 +83,7 @@ jobs:
VLLM_USE_MODELSCOPE: True
strategy:
matrix:
vllm_version: [2918c1b49c88c29783c86f78d2c4221cb9622379, v0.11.0]
vllm_version: [v0.11.2, v0.11.0]
steps:
- name: Install packages
run: |
Expand Down Expand Up @@ -138,7 +138,7 @@ jobs:
name: e2e-light
strategy:
matrix:
vllm_version: [2918c1b49c88c29783c86f78d2c4221cb9622379, v0.11.0]
vllm_version: [v0.11.2, v0.11.0]
# Note (yikun): If CI resource are limited we can split job into two chain jobs
needs: [lint, changes]
# only trigger e2e test after lint passed and the change is e2e related with pull request.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/vllm_ascend_test_full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ jobs:
name: e2e-full
strategy:
matrix:
vllm_version: [2918c1b49c88c29783c86f78d2c4221cb9622379, v0.11.0]
vllm_version: [v0.11.2, v0.11.0]
needs: [changes]
if: ${{ needs.changes.outputs.e2e_tracker == 'true' }}
uses: ./.github/workflows/_e2e_test.yaml
Expand Down
2 changes: 1 addition & 1 deletion docs/source/community/versioning_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ The table below is the release compatibility matrix for vLLM Ascend release.
For main branch of vLLM Ascend, we usually make it compatible with the latest vLLM release and a newer commit hash of vLLM. Please note that this table is usually updated. Please check it regularly.
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|-------------|--------------|------------------|-------------|--------------------|
| main | v0.11.0/2918c1b49c88c29783c86f78d2c4221cb9622379 | >= 3.10, < 3.12 | 8.3.RC1 | 2.7.1 / 2.7.1 |
| main | v0.11.0 or v0.11.2 | >= 3.10, < 3.12 | 8.3.RC1 | 2.7.1 / 2.7.1 |

## Release cadence

Expand Down
29 changes: 18 additions & 11 deletions tests/e2e/multicard/test_offline_inference_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
Run `pytest tests/test_offline_inference.py`.
"""
import os
from typing import List
from unittest.mock import patch

import pytest
Expand Down Expand Up @@ -175,17 +176,23 @@ def test_sp_for_qwen3_moe() -> None:
top_k=50,
top_p=0.9)

with VllmRunner(snapshot_download("Qwen/Qwen3-30B-A3B"),
dtype="auto",
tensor_parallel_size=2,
distributed_executor_backend="mp",
compilation_config={
"pass_config": {
"enable_sequence_parallelism": True
}
},
enable_expert_parallel=True,
enforce_eager=True) as vllm_model:
splitting_ops: List[str] = []
with VllmRunner(
snapshot_download("Qwen/Qwen3-30B-A3B"),
dtype="auto",
tensor_parallel_size=4,
distributed_executor_backend="mp",
compilation_config=
{
"pass_config": {
"enable_sequence_parallelism": True,
},
# FIXME: When check the splitting_ops list is empyt should first check it is not none
# issue has been fixed which imported in PR:https://github.com/vllm-project/vllm/pull/27126
"splitting_ops": splitting_ops
},
enable_expert_parallel=True,
enforce_eager=True) as vllm_model:
vllm_model.generate(example_prompts, sampling_params)


Expand Down
10 changes: 4 additions & 6 deletions tests/e2e/singlecard/test_ascend_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,8 @@ def test_concurrent_partial_prefill(enforce_eager):
},
},
max_num_seqs=3,
max_num_batched_tokens=2048,
max_num_batched_tokens=8192,
enforce_eager=enforce_eager,
max_model_len=2048,
gpu_memory_utilization=0.7) as vllm_model:
outputs = vllm_model.model.generate(["Hello my name is Robert and I"] *
3)
Expand All @@ -38,9 +37,8 @@ def test_prefix_cache_stats_is_recorded(enforce_eager):
},
},
max_num_seqs=3,
max_num_batched_tokens=2048,
max_num_batched_tokens=8192,
enforce_eager=enforce_eager,
max_model_len=2048,
gpu_memory_utilization=0.7) as vllm_model:
# 17 tokens will make sure first 16 tokens are cached in a block
input_tokens = {"prompt_token_ids": [101] * 129}
Expand All @@ -51,7 +49,7 @@ def test_prefix_cache_stats_is_recorded(enforce_eager):

@pytest.mark.parametrize("max_tokens",
[4]) # cannot align results when max_tokens > 4
@pytest.mark.parametrize("chunked_prefill_token_size", [16])
@pytest.mark.parametrize("chunked_prefill_token_size", [2048])
def test_chunked_prefill_with_ascend_scheduler(
max_tokens: int, chunked_prefill_token_size: int) -> None:
example_prompts = [
Expand Down Expand Up @@ -93,7 +91,7 @@ def test_chunked_prefill_with_ascend_scheduler(

@pytest.mark.parametrize("max_tokens",
[4]) # cannot align results when max_tokens > 4
@pytest.mark.parametrize("chunked_prefill_token_size", [16])
@pytest.mark.parametrize("chunked_prefill_token_size", [2048])
def test_chunked_prefill_with_scheduler_dynamic_batch(
max_tokens: int, chunked_prefill_token_size: int) -> None:
example_prompts = [
Expand Down
32 changes: 8 additions & 24 deletions tests/ut/core/test_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,9 +376,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -429,9 +427,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -481,9 +477,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -526,9 +520,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[requests[0].request_id],
req_id_to_index={requests[0].request_id: 0},
Expand Down Expand Up @@ -1069,9 +1061,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -1122,9 +1112,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -1174,9 +1162,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[req.request_id for req in requests],
req_id_to_index={
Expand Down Expand Up @@ -1219,9 +1205,7 @@ def test_stop_via_update_from_output(self):
},
num_common_prefix_blocks=0,
finished_req_ids=set(),
free_encoder_mm_hashes=[],
structured_output_request_ids={},
grammar_bitmask=None)
free_encoder_mm_hashes=[])
model_output = ModelRunnerOutput(
req_ids=[requests[0].request_id],
req_id_to_index={requests[0].request_id: 0},
Expand Down
6 changes: 5 additions & 1 deletion vllm_ascend/attention/attention_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,17 @@

# isort: on

from vllm.attention.backends.registry import (AttentionBackendEnum,
register_backend)


@register_backend(AttentionBackendEnum.CUSTOM, "ASCEND")
class AscendAttentionBackend(AttentionBackend):
accept_output_buffer: bool = True

@staticmethod
def get_name() -> str:
return "ASCEND"
return "CUSTOM"

@staticmethod
def get_impl_cls() -> Type["AscendAttentionBackendImpl"]:
Expand Down
55 changes: 36 additions & 19 deletions vllm_ascend/core/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,25 +483,42 @@ def skip_cur_request():
num_scheduled_tokens, scheduled_spec_decode_tokens,
req_to_new_blocks)
scheduled_cached_reqs = cached_reqs_data

scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=scheduled_cached_reqs,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids, # type: ignore
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
structured_output_request_ids={},
grammar_bitmask=None,
)
if vllm_version_is("0.11.0"):
scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=scheduled_cached_reqs,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids, # type: ignore
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
structured_output_request_ids={},
grammar_bitmask=None,
)
else:
scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=scheduled_cached_reqs,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids, # type: ignore
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
)
Comment on lines +486 to +521
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The introduction of the if vllm_version_is("0.11.0"): block has led to significant code duplication for instantiating SchedulerOutput. The two branches are nearly identical, differing only by two arguments (structured_output_request_ids and grammar_bitmask). This duplication makes the code harder to maintain and increases the risk of introducing bugs if changes are not applied to both branches.

        scheduler_output_kwargs = {
            "scheduled_new_reqs": new_reqs_data,
            "scheduled_cached_reqs": scheduled_cached_reqs,
            "num_scheduled_tokens": num_scheduled_tokens,
            "total_num_scheduled_tokens": total_num_scheduled_tokens,
            "scheduled_spec_decode_tokens": scheduled_spec_decode_tokens,
            "scheduled_encoder_inputs": scheduled_encoder_inputs,
            "num_common_prefix_blocks": num_common_prefix_blocks,
            # finished_req_ids is an existing state in the scheduler,
            # instead of being newly scheduled in this step.
            # It contains the request IDs that are finished in between
            # the previous and the current steps.
            "finished_req_ids": self.finished_req_ids,  # type: ignore
            "free_encoder_mm_hashes": self.encoder_cache_manager.
            get_freed_mm_hashes(),
        }
        if vllm_version_is("0.11.0"):
            scheduler_output_kwargs["structured_output_request_ids"] = {}
            scheduler_output_kwargs["grammar_bitmask"] = None

        scheduler_output = SchedulerOutput(**scheduler_output_kwargs)


# NOTE(Kuntai): this function is designed for multiple purposes:
# 1. Plan the KV cache store
Expand Down
64 changes: 41 additions & 23 deletions vllm_ascend/core/scheduler_dynamic_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -561,29 +561,47 @@ def schedule(self) -> SchedulerOutput:
scheduled_spec_decode_tokens,
req_to_new_blocks,
)
scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs +
scheduled_resumed_reqs)
structured_output_request_ids, grammar_bitmask = (
self.get_grammar_bitmask(scheduled_requests,
scheduled_spec_decode_tokens))
scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=cached_reqs_data,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids,
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
structured_output_request_ids=structured_output_request_ids,
grammar_bitmask=grammar_bitmask,
)
if vllm_version_is("0.11.0"):
scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs +
scheduled_resumed_reqs)
structured_output_request_ids, grammar_bitmask = (
self.get_grammar_bitmask(scheduled_requests,
scheduled_spec_decode_tokens))
scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=cached_reqs_data,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids,
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
structured_output_request_ids=structured_output_request_ids,
grammar_bitmask=grammar_bitmask,
)
else:
scheduler_output = SchedulerOutput(
scheduled_new_reqs=new_reqs_data,
scheduled_cached_reqs=cached_reqs_data,
num_scheduled_tokens=num_scheduled_tokens,
total_num_scheduled_tokens=total_num_scheduled_tokens,
scheduled_spec_decode_tokens=scheduled_spec_decode_tokens,
scheduled_encoder_inputs=scheduled_encoder_inputs,
num_common_prefix_blocks=num_common_prefix_blocks,
# finished_req_ids is an existing state in the scheduler,
# instead of being newly scheduled in this step.
# It contains the request IDs that are finished in between
# the previous and the current steps.
finished_req_ids=self.finished_req_ids,
free_encoder_mm_hashes=self.encoder_cache_manager.
get_freed_mm_hashes(),
)
Comment on lines +564 to +604
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to another file in this PR, this change introduces significant code duplication for instantiating SchedulerOutput based on a version check. This makes the code harder to read and maintain. A bug fix in one branch might be missed in the other.

        scheduler_output_kwargs = {
            "scheduled_new_reqs": new_reqs_data,
            "scheduled_cached_reqs": cached_reqs_data,
            "num_scheduled_tokens": num_scheduled_tokens,
            "total_num_scheduled_tokens": total_num_scheduled_tokens,
            "scheduled_spec_decode_tokens": scheduled_spec_decode_tokens,
            "scheduled_encoder_inputs": scheduled_encoder_inputs,
            "num_common_prefix_blocks": num_common_prefix_blocks,
            # finished_req_ids is an existing state in the scheduler,
            # instead of being newly scheduled in this step.
            # It contains the request IDs that are finished in between
            # the previous and the current steps.
            "finished_req_ids": self.finished_req_ids,
            "free_encoder_mm_hashes": self.encoder_cache_manager.
            get_freed_mm_hashes(),
        }
        if vllm_version_is("0.11.0"):
            scheduled_requests = (scheduled_new_reqs + scheduled_running_reqs +
                                  scheduled_resumed_reqs)
            structured_output_request_ids, grammar_bitmask = (
                self.get_grammar_bitmask(scheduled_requests,
                                         scheduled_spec_decode_tokens))
            scheduler_output_kwargs["structured_output_request_ids"] = structured_output_request_ids
            scheduler_output_kwargs["grammar_bitmask"] = grammar_bitmask

        scheduler_output = SchedulerOutput(**scheduler_output_kwargs)


# NOTE(Kuntai): this function is designed for multiple purposes:
# 1. Plan the KV cache store
Expand Down
Loading
Loading