Skip to content

[RFC]: Wan2.2 Performance Optimization Roadmap on vLLM-Omni  #1355

@FrosterHan

Description

@FrosterHan

Motivation.

WAN2.2 is a state‑of‑the‑art image‑to‑video (I2V) diffusion model that unlocks new possibilities in creative content generation, autonomous driving simulation, and interactive media. As one of the I2V models integrated into vLLM-OMNI, it represents a pioneering step toward accessible, high‑performance video AI.

NPU demand.

With NPU accelerators (Ascend, Cambricon, etc.) becoming increasingly prevalent in production environments, there is strong demand to run WAN2.2 efficiently on these platforms. Currently, for offline serving on 8× NPU cards with 480×832 resolution and 81 frames, the total inference time is 215 seconds. This baseline reveals clear room for optimization in operators, distributed strategies, and NPU‑specific execution models. At the same time, low‑latency online serving on NPU is also an emerging requirement that this project will address.

GPU demand.

GPUs remain the most widely adopted acceleration platform for generative AI today. vLLM-OMNI has built a solid foundation on GPU, achieving 173 seconds for the same workload on 8× GPU cards. However, as WAN2.2 evolves and deployment scales, further GPU optimizations are essential to lower costs and enable real‑time applications. The community’s expertise on GPU will also directly inform and accelerate our NPU efforts. vLLM-OMNI is committed to delivering first‑class performance on both GPU and NPU, ensuring that users can deploy WAN2.2 on the platform that best suits their needs without compromise.

This RFC outlines a focused optimization plan to dramatically reduce WAN2.2 inference latency and improve hardware usability, making it viable for real‑world applications and establishing vLLM-OMNI as the go‑to framework for I2V deployment.

Goals

Performance: Significantly reduce end‑to‑end latency from the current 173s/215 s (480×832, 81 frames).
Reusability: Deliver generalized optimizations that benefit other diffusion models in vLLM-OMNI

Proposed Change.

LA operator fusion https://github.com/vllm-project/vllm-omni/pull/1342

FSDP distributed strategy tuning https://github.com/vllm-project/vllm-omni/pull/1339

VAE patching https://github.com/vllm-project/vllm-omni/pull/1350

Layerwise offload / NPU https://github.com/vllm-project/vllm-omni/pull/1356:
the current NPU backend lacks independent stream / asynchronous execution support, causing severe performance degradation when offloading is enabled. Interested contributors are welcome to engage in existing offload‑related discussions.

Feedback Period.

Ready PR First version ready for feedback: Feb 14
Other community suggestions and contributions are welcomed all time.

Call for Collaboration

I2V is the next frontier in generative AI, and vLLM-OMNI is uniquely positioned to lead its efficient deployment. We warmly welcome feedback, testing on different NPU hardware, and contributions – especially on the layerwise offload / stream async challenge.

Join us in making WAN2.2 on NPU not just possible, but production‑ready

CC List.

@ApsarasX @hsliuustc0106 @gcanlin

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

NPUPR related to Ascend NPUgood first issueGood for newcomershelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asap

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions