-
Notifications
You must be signed in to change notification settings - Fork 509
Description
Background
We have completed the initial Ascend NPU enablement in vllm-omni v0.11.0rc1 and v0.12.0rc1, with support for most mainstream models such as Qwen3-Omni and the Qwen-Image series.
Building on this foundation, the next phase will focus on systematically expanding model coverage and prioritizing performance optimization efforts, with a clear roadmap to improve scalability, stability, and overall serving efficiency on Ascend NPU.
Version match
Currently, vLLM-Omni’s NPU support depends on vLLM-Ascend, the Ascend support plugin of vLLM. The AR (auto-regressive) path is jointly supported by vLLM and vLLM-Ascend.
Meanwhile, MindIE-SD serves as a standalone Ascend-optimized diffusion operator library. It is currently integrated through the FlashAttentionBackend and a set of CustomOp, delivering Ascend-native operators to improve the performance of diffusion models.
We're also building the separate plugin platform in vLLM-Omni to support scalable hardware better in the future.
| vLLM | vLLM-Ascend | vLLM-Omni | MindIE-SD(Optional) | status |
|---|---|---|---|---|
| v0.11.0 | v0.11.0rc2 | v0.11.0rc1 | NA | released |
| v0.12.0 | v0.12.0rc1 | v0.12.0rc1 | main | released |
| v0.14.0 | v0.14.0rc1 | v0.14.0 | main | released |
| v0.15.0 | v0.15.0rc1 | v0.15.0rc1 | main | skipped |
| v0.16.0 | e2175d9 | v0.16.0 | main | released |
| v0.16.0 | 0.16.0rc1 | v0.16.0 | main | pending |
How to install MindIE-SD
Official Link: MindIE-SD
We are actively working to simplify the installation of mindie-sd. Eventually, it will be available via pip install mindie-sd. At the moment, however, some additional work is required.
git clone https://gitcode.com/Ascend/MindIE-SD.git && cd MindIE-SD
# Need to comment the line `source ${current_script_dir}/build_tik_ops.sh` in build/build_ops.sh
sed -i 's|^\(\s*\)source ${current_script_dir}/build_tik_ops.sh|\1# source ${current_script_dir}/build_tik_ops.sh|' build/build_ops.sh
python setup.py bdist_wheel
cd dist
pip install mindiesd-*.whl
Feature Support
Omni(AR+Generator) Pipeline
- Async chunk: follow this RFC [RFC]: Support async computation and communication across stages by chunks #268
- TTS Performance Optimization: [RFC]: TTS Model Performance Optimization on NPU #1600
- Support code2wav multi-batch
- Streaming input and output
- Talker ACL graph Support
- More Ascend-friendly Ops
- Expert Parallelism (EP)
Diffusion Pipeline
- Support sparse attention backend by integrating the sparse attention interface from MindIE-SD
- Support LA from MindIE-SD
- Following [RFC]: Diffusion Models Features Supports Plan #814's features, make sure them work on NPU
-
Qwen-Image-Edit-2511Optimization - Wan2.2 Optimization: [RFC]: Wan2.2 Performance Optimization Roadmap on vLLM-Omni #1355
- Remove NPU hardcode: [Feat] support SP for FLUX.2-klein #1250 (comment)
- Refactor ring attention for hardware dispatch: [Bugfix] Fix ring attention for npu device #755
Others(UX & Hardware Scalable)
- Platform: [Hardware] Support platforms and plugin system #774
- Disable torch compile by default: [Platform] Add supports_torch_inductor interface #1108
- Dependencies router: [Hardware] [Feat] Setup platform dependent package installation #1046
Docs
Known Issues
- Memory usage: currently, Qwen2.5-Omni and Qwen3-Omni have to separate talker to one different device from thinker. We expect to make them together so that Qwen2.5-Omni and Qwen3-Omni would only need 2 and 4 cards.
- Qwen2.5-Omni: enabling ACL graph leads to accuracy problem. [Bug]: vllm-omni(v0.12.0) results of talker model of qwen2.5-omni are incorrect when running with enforce eager being False #912
- Qwen3-Omni: talker ACL graph breaks. [NPU] Align with GPUModelRunner #1114 (comment)
- [Feature][NPU]: Under the same request, the quality of the raw image on the NPU differs from that on the NVIDIA GPU. #1322: If you obtain images on NPU that differ from those on GPU, this is normal and expected, as long as there is no obvious degradation (e.g., blurred faces, failure to follow the prompt, etc.). You can make the outputs closer by fixing the same seed and using a CPU-based generator(online server can use [Misc] Add per-request generator_device to online image gen and edit #1183). However, even in that case, minor differences may still remain. The root cause is that different hardware backends cannot perfectly align in PyTorch operator implementations, such as
conv3d,nn.Linear, and others.
Model Support List
| Architecture | Models | Example HF Models | NPU support |
|---|---|---|---|
Qwen3OmniMoeForConditionalGeneration |
Qwen3-Omni | Qwen/Qwen3-Omni-30B-A3B-Instruct |
✅ |
Qwen2_5OmniForConditionalGeneration |
Qwen2.5-Omni | Qwen/Qwen2.5-Omni-7B, Qwen/Qwen2.5-Omni-3B |
✅ |
BagelForConditionalGeneration |
BAGEL (DiT-only) | ByteDance-Seed/BAGEL-7B-MoT |
|
QwenImagePipeline |
Qwen-Image | Qwen/Qwen-Image |
✅ |
QwenImagePipeline |
Qwen-Image-2512 | Qwen/Qwen-Image-2512 |
✅ |
QwenImageEditPipeline |
Qwen-Image-Edit | Qwen/Qwen-Image-Edit |
✅ |
QwenImageEditPlusPipeline |
Qwen-Image-Edit-2509 | Qwen/Qwen-Image-Edit-2509 |
✅ |
QwenImageLayeredPipeline |
Qwen-Image-Layered | Qwen/Qwen-Image-Layered |
✅ |
ZImagePipeline |
Z-Image | Tongyi-MAI/Z-Image-Turbo |
✅ |
WanPipeline |
Wan2.2-T2V, Wan2.2-TI2V | Wan-AI/Wan2.2-T2V-A14B-Diffusers, Wan-AI/Wan2.2-TI2V-5B-Diffusers |
✅ |
WanImageToVideoPipeline |
Wan2.2-I2V | Wan-AI/Wan2.2-I2V-A14B-Diffusers |
✅ |
OvisImagePipeline |
Ovis-Image | OvisAI/Ovis-Image |
|
LongcatImagePipeline |
LongCat-Image | meituan-longcat/LongCat-Image |
✅ |
LongCatImageEditPipeline |
LongCat-Image-Edit | meituan-longcat/LongCat-Image-Edit |
|
StableDiffusion3Pipeline |
Stable-Diffusion-3 | stabilityai/stable-diffusion-3.5-medium |
|
Flux2KleinPipeline |
FLUX.2-klein | black-forest-labs/FLUX.2-klein-4B, black-forest-labs/FLUX.2-klein-9B |
|
StableAudioPipeline |
Stable-Audio-Open | stabilityai/stable-audio-open-1.0 |
|
Qwen3TTSForConditionalGeneration |
Qwen3-TTS-12Hz-1.7B-CustomVoice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
✅ |
Qwen3TTSForConditionalGeneration |
Qwen3-TTS-12Hz-1.7B-VoiceDesign | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
✅ |
Qwen3TTSForConditionalGeneration |
Qwen3-TTS-12Hz-1.7B-Base | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
✅ |
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.