[V1][Core] Add support for V1 Engine #295

shen-shanshan · 2025-03-11T07:07:01Z

What this PR does / why we need it?

Add support for V1 Engine.

Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us.

Does this PR introduce any user-facing change?

To use V1 Engine on NPU device, you need to set the env variable shown below:

export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

If you are using vllm for offline inferencing, you must add a __main__ guard like:

if __name__ == '__main__':

    llm = vllm.LLM(...)

Find more details here.

How was this patch tested?

I have tested the online serving with Qwen2.5-7B-Instruct using this command:

vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240

Query the model with input prompts:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'

The test logs are shown below:

INFO 03-11 06:18:03 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 03-11 06:18:03 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 03-11 06:18:03 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 03-11 06:18:03 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-11 06:18:03 [__init__.py:44] plugin ascend loaded.
INFO 03-11 06:18:03 [__init__.py:247] Platform plugin ascend is activated
INFO 03-11 06:18:06 [core.py:51] Initializing a V1 LLM engine (v0.7.4.dev360+gc91b64f7) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=26240, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}

...

INFO 03-11 06:18:30 [loader.py:429] Loading weights took 3.40 seconds
INFO 03-11 06:18:32 [kv_cache_utils.py:537] GPU KV cache size: 744,048 tokens
INFO 03-11 06:18:32 [kv_cache_utils.py:540] Maximum concurrency for 26,240 tokens per request: 28.36x
npu not support graph capture. current compilation level : CompilationLevel.NO_COMPILATION
INFO 03-11 06:18:32 [core.py:120] init engine (profile, create kv cache, warmup model) took 2.06 seconds

...

INFO 03-11 06:18:37 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO:     127.0.0.1:46928 - "POST /v1/completions HTTP/1.1" 200 OK

wangxiyuan

I notice you copied much code from vllm cuda v1. Please check if you can just import the code instaed of rewrite.

Another question:
Do we really need v1 module? How about just move the file to the right place and rename to something like model_runner_v1.py?

wangxiyuan · 2025-03-11T07:21:29Z

vllm_ascend/platform.py

-        return int(physical_device_id)
-    else:
-        return device_id
+# def _device_id_to_physical_device_id(device_id: int) -> int:


do not comment code, please remove it if it's useless, otherwise add a note here to explain why the code is comment.

How about just move the file to the right place and rename to something like model_runner_v1.py?

I think it's a good idea.

wangxiyuan · 2025-03-11T07:21:47Z

vllm_ascend/platform.py

    def get_device_name(cls, device_id: int = 0) -> str:
-        physical_device_id = _device_id_to_physical_device_id(device_id)
-        return torch.npu.get_device_name(physical_device_id)
+        # physical_device_id = _device_id_to_physical_device_id(device_id)


wangxiyuan · 2025-03-11T07:22:35Z

vllm_ascend/platform.py

    def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
+        compilation_config = vllm_config.compilation_config
+        if compilation_config.level != CompilationLevel.NO_COMPILATION:
+            logger.info("[NPU] Forcing NO_COMPILATION compilation level")


print current compilation_config.level to log as well. and change to warning level.

wangxiyuan · 2025-03-11T07:23:20Z

vllm_ascend/platform.py

+            logger.warning("[V1][NPU] Disable prefix caching")
+            cache_config.enable_prefix_caching = False
+
+        assert not vllm_config.speculative_config, (


Speculative decoding works now for 0.7.3. we don't need this in main IMO.

wangxiyuan · 2025-03-11T07:24:27Z

vllm_ascend/v1/npu_attention.py

+
+import torch
+
+try:


no need to try catch, just import torch_npu is fine

wangxiyuan · 2025-03-11T07:26:21Z

vllm_ascend/v1/npu_model_runner.py

+logger = init_logger(__name__)
+
+
+class NPUModelRunner(LoRAModelRunnerMixin):


why based from LORA？

In vLLM V1, LoRAModelRunnerMixin is a base class for GPUModelRunner, but I see TPUModelRunner doesn't extends this base class, so I think NPUModelRunner may also don't neet to extends this base class. I will modify this soon.

wangxiyuan · 2025-03-11T07:26:41Z

vllm_ascend/v1/npu_model_runner.py

+            vocab_size=model_config.get_vocab_size(),
+        )
+
+        self.use_cuda_graph = (self.vllm_config.compilation_config.level


remove cuda related code

wuhuikx · 2025-03-11T07:22:48Z

vllm_ascend/v1/npu_worker.py

+    def init_device(self):
+        if self.device_config.device.type == "npu":
+            # # This env var set by Ray causes exceptions with graph building.
+            # os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)


please remove the NCCL related comments.

MengqingCao · 2025-03-13T02:28:57Z

vllm_ascend/platform.py

+            logger.warning(
+                "Compilation level %s is not supported on NPU now, forcing compilation level to NO_COMPILATION",
+                compilation_config.level)
+            compilation_config.level = CompilationLevel.NO_COMPILATION


Is torch.compile enabled by default in v1? and I think the graph feature is wip now?

MengqingCao · 2025-03-13T02:35:13Z

vllm_ascend/worker/model_runner_v1.py

+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm-project/vllm/vllm/worker/model_runner.py


just make sure if it is vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py

Yes, I will modify it.

MengqingCao · 2025-03-13T02:46:15Z

vllm_ascend/worker/model_runner_v1.py

+                    self.model_memory_usage / float(2**30))
+
+    def capture_model(self) -> None:
+        logger.warning(


We can just remove this func cause this is for cuda graph

MengqingCao · 2025-03-13T02:46:46Z

vllm_ascend/worker/worker_v1.py

+
+    def compile_or_warm_up_model(self) -> None:
+        if not self.model_config.enforce_eager:
+            self.model_runner.capture_model()


Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

### What this PR does / why we need it? Add support for V1 Engine on v0.7.3. ### Does this PR introduce _any_ user-facing change? Find more details at #295. Plus, due to a bug of `vllm v0.7.3` when using modelscope, you should use a lower version of `modelscope`. Find more details at vllm-project/vllm#13807. This can work: ```bash pip install modelscope==1.21.1 ``` ### How was this patch tested? Find more details at #295. Signed-off-by: shen-shanshan <[email protected]> Co-authored-by: didongli182 <[email protected]>

### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <[email protected]> Co-authored-by: didongli182 <[email protected]>

github-actions bot added the module:core label Mar 11, 2025

wangxiyuan reviewed Mar 11, 2025

View reviewed changes

wuhuikx reviewed Mar 11, 2025

View reviewed changes

MengqingCao reviewed Mar 13, 2025

View reviewed changes

shen-shanshan force-pushed the v1 branch 2 times, most recently from bdccda1 to 44560c7 Compare March 18, 2025 02:16

github-actions bot added documentation Improvements or additions to documentation module:tools labels Mar 18, 2025

shen-shanshan and others added 3 commits March 20, 2025 09:22

support v1 engine

2a208f5

Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

bugfix

35cbc19

Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

format

28ed158

Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan force-pushed the v1 branch from 4a10272 to 28ed158 Compare March 20, 2025 09:25

wangxiyuan approved these changes Mar 20, 2025

View reviewed changes

wangxiyuan merged commit c06af8b into vllm-project:main Mar 20, 2025
16 of 17 checks passed

shen-shanshan mentioned this pull request Mar 21, 2025

[V1][Core] Add support for V1 Engine on v0.7.3 #376

Merged

shen-shanshan mentioned this pull request Mar 27, 2025

[Guide] V1 Engine #414

Closed

		logger = init_logger(__name__)


		class NPUModelRunner(LoRAModelRunnerMixin):

[V1][Core] Add support for V1 Engine #295

[V1][Core] Add support for V1 Engine #295

Uh oh!

Conversation

shen-shanshan commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shen-shanshan commented Mar 11, 2025 •

edited

Loading

shen-shanshan Mar 11, 2025 •

edited

Loading