-
Notifications
You must be signed in to change notification settings - Fork 617
[V1][Core] Add support for V1 Engine #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice you copied much code from vllm cuda v1. Please check if you can just import the code instaed of rewrite.
Another question:
Do we really need v1 module? How about just move the file to the right place and rename to something like model_runner_v1.py?
vllm_ascend/platform.py
Outdated
| return int(physical_device_id) | ||
| else: | ||
| return device_id | ||
| # def _device_id_to_physical_device_id(device_id: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not comment code, please remove it if it's useless, otherwise add a note here to explain why the code is comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just move the file to the right place and rename to something like model_runner_v1.py?
I think it's a good idea.
vllm_ascend/platform.py
Outdated
| def get_device_name(cls, device_id: int = 0) -> str: | ||
| physical_device_id = _device_id_to_physical_device_id(device_id) | ||
| return torch.npu.get_device_name(physical_device_id) | ||
| # physical_device_id = _device_id_to_physical_device_id(device_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
vllm_ascend/platform.py
Outdated
| def check_and_update_config(cls, vllm_config: VllmConfig) -> None: | ||
| compilation_config = vllm_config.compilation_config | ||
| if compilation_config.level != CompilationLevel.NO_COMPILATION: | ||
| logger.info("[NPU] Forcing NO_COMPILATION compilation level") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print current compilation_config.level to log as well. and change to warning level.
vllm_ascend/platform.py
Outdated
| logger.warning("[V1][NPU] Disable prefix caching") | ||
| cache_config.enable_prefix_caching = False | ||
|
|
||
| assert not vllm_config.speculative_config, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speculative decoding works now for 0.7.3. we don't need this in main IMO.
vllm_ascend/v1/npu_attention.py
Outdated
|
|
||
| import torch | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to try catch, just import torch_npu is fine
vllm_ascend/v1/npu_model_runner.py
Outdated
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class NPUModelRunner(LoRAModelRunnerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why based from LORA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In vLLM V1, LoRAModelRunnerMixin is a base class for GPUModelRunner, but I see TPUModelRunner doesn't extends this base class, so I think NPUModelRunner may also don't neet to extends this base class. I will modify this soon.
vllm_ascend/v1/npu_model_runner.py
Outdated
| vocab_size=model_config.get_vocab_size(), | ||
| ) | ||
|
|
||
| self.use_cuda_graph = (self.vllm_config.compilation_config.level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove cuda related code
vllm_ascend/v1/npu_worker.py
Outdated
| def init_device(self): | ||
| if self.device_config.device.type == "npu": | ||
| # # This env var set by Ray causes exceptions with graph building. | ||
| # os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the NCCL related comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK👌
| logger.warning( | ||
| "Compilation level %s is not supported on NPU now, forcing compilation level to NO_COMPILATION", | ||
| compilation_config.level) | ||
| compilation_config.level = CompilationLevel.NO_COMPILATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is torch.compile enabled by default in v1? and I think the graph feature is wip now?
| # | ||
| # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved. | ||
| # This file is a part of the vllm-ascend project. | ||
| # Adapted from vllm-project/vllm/vllm/worker/model_runner.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make sure if it is vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will modify it.
| self.model_memory_usage / float(2**30)) | ||
|
|
||
| def capture_model(self) -> None: | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just remove this func cause this is for cuda graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
vllm_ascend/worker/worker_v1.py
Outdated
|
|
||
| def compile_or_warm_up_model(self) -> None: | ||
| if not self.model_config.enforce_eager: | ||
| self.model_runner.capture_model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
bdccda1 to
44560c7
Compare
Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>
Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>
Co-authored-by: didongli182 <[email protected]> Signed-off-by: shen-shanshan <[email protected]>
### What this PR does / why we need it? Add support for V1 Engine on v0.7.3. ### Does this PR introduce _any_ user-facing change? Find more details at #295. Plus, due to a bug of `vllm v0.7.3` when using modelscope, you should use a lower version of `modelscope`. Find more details at vllm-project/vllm#13807. This can work: ```bash pip install modelscope==1.21.1 ``` ### How was this patch tested? Find more details at #295. Signed-off-by: shen-shanshan <[email protected]> Co-authored-by: didongli182 <[email protected]>
### What this PR does / why we need it?
Add support for V1 Engine.
Please note that this is just the initial version, and there may be some
places need to be fixed or optimized in the future, feel free to leave
some comments to us.
### Does this PR introduce _any_ user-facing change?
To use V1 Engine on NPU device, you need to set the env variable shown
below:
```bash
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```
If you are using vllm for offline inferencing, you must add a `__main__`
guard like:
```bash
if __name__ == '__main__':
llm = vllm.LLM(...)
```
Find more details
[here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing).
### How was this patch tested?
I have tested the online serving with `Qwen2.5-7B-Instruct` using this
command:
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
Query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
---------
Signed-off-by: shen-shanshan <[email protected]>
Co-authored-by: didongli182 <[email protected]>
### What this PR does / why we need it?
Add support for V1 Engine.
Please note that this is just the initial version, and there may be some
places need to be fixed or optimized in the future, feel free to leave
some comments to us.
### Does this PR introduce _any_ user-facing change?
To use V1 Engine on NPU device, you need to set the env variable shown
below:
```bash
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```
If you are using vllm for offline inferencing, you must add a `__main__`
guard like:
```bash
if __name__ == '__main__':
llm = vllm.LLM(...)
```
Find more details
[here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing).
### How was this patch tested?
I have tested the online serving with `Qwen2.5-7B-Instruct` using this
command:
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
Query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
---------
Signed-off-by: shen-shanshan <[email protected]>
Co-authored-by: didongli182 <[email protected]>
What this PR does / why we need it?
Add support for V1 Engine.
Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us.
Does this PR introduce any user-facing change?
To use V1 Engine on NPU device, you need to set the env variable shown below:
If you are using vllm for offline inferencing, you must add a
__main__guard like:Find more details here.
How was this patch tested?
I have tested the online serving with
Qwen2.5-7B-Instructusing this command:Query the model with input prompts:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }'The test logs are shown below: