-
Notifications
You must be signed in to change notification settings - Fork 476
Description
Motivation.
FP8 quantization support was recently added for Z-Image DiT #b7604ae. This feature enables significant memory
reduction and potential speedups on supported hardware (Ada/Hopper GPUs).
Wan 2.2 is a video generation model that would greatly benefit from FP8 quantization due to its high memory requirements for 3D video transformers. This issue tracks extending the same FP8 quantization infrastructure to Wan 2.2.
Current State
- FP8 quantization framework exists in vllm_omni/diffusion/quantization/
- Z-Image transformer fully supports FP8 via quant_config parameter
- Wan 2.2 transformer (WanTransformer3DModel) does not accept quantization config
Proposed Change.
1. Transformer Layer Modifications
File: vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py
| Class | Changes Required |
|---|---|
WanSelfAttention |
Add quant_config parameterPass it to QKVParallelLinear and output projection |
WanCrossAttention |
Add quant_config parameterPass it to Q/K/V and output linear layers |
WanTransformerBlock |
Add quant_config parameterPropagate to attention and FFN layers |
WanTransformer3DModel |
Add quant_config parameterPropagate to all transformer blocks |
2. Pipeline Integration
File: vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py
| Step | Description |
|---|---|
| 1 | Extract quantization config from OmniDiffusionConfig using get_vllm_quant_config_for_layers() |
| 2 | Pass the extracted quant_config to WanTransformer3DModel initialization |
3. CLI Support
Directory: examples/offline_inference/video/ (if applicable / create if missing)
- Add
--quantizationargument- Support
fp8option (e.g.--quantization fp8) - Should enable FP8 quantization flow when provided
- Keep backward compatibility (no change when argument is omitted)
- Support
4. Tests
Directory: tests/diffusion/models/wan2_2/ (add to existing or create new test file(s))
| Test Case | Description |
|---|---|
| Unit tests for Wan 2.2 with FP8 quantization | Run full forward pass with quant_config enabled (FP8) and verify no crashes / reasonable outputs |
| Config propagation verification | Check that quant_config reaches all linear layers in the transformer blocks (self-attn QKV/out, cross-attn QKV/out, FFN linears) |
| Null config fallback | Ensure existing Wan 2.2 behavior is unchanged when quant_config=None (no quantization applied) |
Acceptance Criteria
WanTransformer3DModelconstructor acceptsquant_configparameter- All relevant linear layers inside Wan 2.2 transformer receive the quantization config
- Pipeline correctly extracts quantization config from
OmniDiffusionConfigand passes it downstream - When
quant_config=None, model runs in original (non-quantized) mode with unchanged functionality & outputs - All added / modified unit tests pass
- (Optional but recommended) Example inference script demonstrates successful FP8 usage (e.g. in
examples/)
Reference Implementation
Use the Z-Image FP8 implementation as the main reference:
| Component | File Path | Notes |
|---|---|---|
| Transformer | vllm_omni/diffusion/models/z_image/z_image_transformer.py |
Shows how quant_config is threaded through attention / feed-forward layers |
| Pipeline | vllm_omni/diffusion/models/z_image/pipeline_z_image.py |
Shows extraction from config and passing to model init |
| Commit | b7604ae |
Full diff / context for the Z-Image FP8 PR |
Additional Context
Hardware Requirements for FP8:
| Quantization Mode | Supported GPUs | Compute Capability |
|---|---|---|
| Full W8A8 (weights + activations) | Ada Lovelace, Hopper | SM 89+ |
| Weight-only FP8 | Turing and newer | SM 75+ (falls back to W8A16 on Ampere via Marlin kernels) |
Related Links / References:
- PR: #b7604ae — FP8 quantization support for Z-Image
- Feature: [Feature] Support cache-dit for Wan 2.2 inference #1021 — Cache-DiT optimizations for Wan 2.2 inference
Feedback Period.
No response
CC List.
@ZJY0516 @hsliuustc0106 @SamitHuang @david6666666
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.