-
Notifications
You must be signed in to change notification settings - Fork 512
Description
Motivation.
WAN2.2 is a state‑of‑the‑art image‑to‑video (I2V) diffusion model that unlocks new possibilities in creative content generation, autonomous driving simulation, and interactive media. As one of the I2V models integrated into vLLM-OMNI, it represents a pioneering step toward accessible, high‑performance video AI.
NPU demand.
With NPU accelerators (Ascend, Cambricon, etc.) becoming increasingly prevalent in production environments, there is strong demand to run WAN2.2 efficiently on these platforms. Currently, for offline serving on 8× NPU cards with 480×832 resolution and 81 frames, the total inference time is 215 seconds. This baseline reveals clear room for optimization in operators, distributed strategies, and NPU‑specific execution models. At the same time, low‑latency online serving on NPU is also an emerging requirement that this project will address.
GPU demand.
GPUs remain the most widely adopted acceleration platform for generative AI today. vLLM-OMNI has built a solid foundation on GPU, achieving 173 seconds for the same workload on 8× GPU cards. However, as WAN2.2 evolves and deployment scales, further GPU optimizations are essential to lower costs and enable real‑time applications. The community’s expertise on GPU will also directly inform and accelerate our NPU efforts. vLLM-OMNI is committed to delivering first‑class performance on both GPU and NPU, ensuring that users can deploy WAN2.2 on the platform that best suits their needs without compromise.
This RFC outlines a focused optimization plan to dramatically reduce WAN2.2 inference latency and improve hardware usability, making it viable for real‑world applications and establishing vLLM-OMNI as the go‑to framework for I2V deployment.
Goals
Performance: Significantly reduce end‑to‑end latency from the current 173s/215 s (480×832, 81 frames).
Reusability: Deliver generalized optimizations that benefit other diffusion models in vLLM-OMNI
Proposed Change.
LA operator fusion https://github.com/vllm-project/vllm-omni/pull/1342
FSDP distributed strategy tuning https://github.com/vllm-project/vllm-omni/pull/1339
VAE patching https://github.com/vllm-project/vllm-omni/pull/1350
Layerwise offload / NPU https://github.com/vllm-project/vllm-omni/pull/1356:
the current NPU backend lacks independent stream / asynchronous execution support, causing severe performance degradation when offloading is enabled. Interested contributors are welcome to engage in existing offload‑related discussions.
Feedback Period.
Ready PR First version ready for feedback: Feb 14
Other community suggestions and contributions are welcomed all time.
Call for Collaboration
I2V is the next frontier in generative AI, and vLLM-OMNI is uniquely positioned to lead its efficient deployment. We warmly welcome feedback, testing on different NPU hardware, and contributions – especially on the layerwise offload / stream async challenge.
Join us in making WAN2.2 on NPU not just possible, but production‑ready
CC List.
@ApsarasX @hsliuustc0106 @gcanlin
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.