[RFC]: Add FP8 quantization support for Wan 2.2 Transformer

### Motivation.

FP8 quantization support was recently added for Z-Image DiT #b7604ae. This feature enables significant memory
reduction and potential speedups on supported hardware (Ada/Hopper GPUs).

Wan 2.2 is a video generation model that would greatly benefit from FP8 quantization due to its high memory requirements for 3D video transformers. This issue tracks extending the same FP8 quantization infrastructure to Wan 2.2.

**Current State**  
- FP8 quantization framework exists in vllm_omni/diffusion/quantization/
- Z-Image transformer fully supports FP8 via quant_config parameter
- Wan 2.2 transformer (WanTransformer3DModel) does not accept quantization config


### Proposed Change.

### 1. Transformer Layer Modifications

**File:** `vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py`

| Class                  | Changes Required                                                                 |
|------------------------|----------------------------------------------------------------------------------|
| `WanSelfAttention`     | Add `quant_config` parameter<br>Pass it to QKVParallelLinear and output projection |
| `WanCrossAttention`    | Add `quant_config` parameter<br>Pass it to Q/K/V and output linear layers        |
| `WanTransformerBlock`  | Add `quant_config` parameter<br>Propagate to attention and FFN layers            |
| `WanTransformer3DModel`| Add `quant_config` parameter<br>Propagate to all transformer blocks              |

## 2. Pipeline Integration

**File:** `vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py`

| Step | Description |
|------|-------------|
| 1    | Extract quantization config from `OmniDiffusionConfig` using `get_vllm_quant_config_for_layers()` |
| 2    | Pass the extracted `quant_config` to `WanTransformer3DModel` initialization |

## 3. CLI Support

**Directory:** `examples/offline_inference/video/` (if applicable / create if missing)

- Add `--quantization` argument
  - Support `fp8` option (e.g. `--quantization fp8`)
  - Should enable FP8 quantization flow when provided
  - Keep backward compatibility (no change when argument is omitted)

## 4. Tests

**Directory:** `tests/diffusion/models/wan2_2/` (add to existing or create new test file(s))

| Test Case | Description |
|-----------|-------------|
| Unit tests for Wan 2.2 with FP8 quantization | Run full forward pass with `quant_config` enabled (FP8) and verify no crashes / reasonable outputs |
| Config propagation verification | Check that `quant_config` reaches all linear layers in the transformer blocks (self-attn QKV/out, cross-attn QKV/out, FFN linears) |
| Null config fallback | Ensure existing Wan 2.2 behavior is unchanged when `quant_config=None` (no quantization applied) |

## Acceptance Criteria

- `WanTransformer3DModel` constructor accepts `quant_config` parameter
- All relevant linear layers inside Wan 2.2 transformer receive the quantization config
- Pipeline correctly extracts quantization config from `OmniDiffusionConfig` and passes it downstream
- When `quant_config=None`, model runs in original (non-quantized) mode with unchanged functionality & outputs
- All added / modified unit tests pass
- **(Optional but recommended)** Example inference script demonstrates successful FP8 usage (e.g. in `examples/`)

## Reference Implementation

Use the Z-Image FP8 implementation as the main reference:

| Component | File Path | Notes |
|-----------|-----------|-------|
| Transformer | `vllm_omni/diffusion/models/z_image/z_image_transformer.py` | Shows how `quant_config` is threaded through attention / feed-forward layers |
| Pipeline | `vllm_omni/diffusion/models/z_image/pipeline_z_image.py` | Shows extraction from config and passing to model init |
| Commit | `b7604ae` | Full diff / context for the Z-Image FP8 PR |

## Additional Context

**Hardware Requirements for FP8:**

| Quantization Mode | Supported GPUs | Compute Capability |
|-------------------|----------------|--------------------|
| Full W8A8 (weights + activations) | Ada Lovelace, Hopper | SM 89+ |
| Weight-only FP8 | Turing and newer | SM 75+ (falls back to W8A16 on Ampere via Marlin kernels) |

**Related Links / References:**

- PR: [#b7604ae](https://github.com/vllm-project/vllm-omni/commit/b7604ae) — FP8 quantization support for Z-Image
- Feature: #1021 — Cache-DiT optimizations for Wan 2.2 inference




### Feedback Period.

_No response_

### CC List.

@ZJY0516 @hsliuustc0106 @SamitHuang @david6666666 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Class	Changes Required
`WanSelfAttention`	Add `quant_config` parameter Pass it to QKVParallelLinear and output projection
`WanCrossAttention`	Add `quant_config` parameter Pass it to Q/K/V and output linear layers
`WanTransformerBlock`	Add `quant_config` parameter Propagate to attention and FFN layers
`WanTransformer3DModel`	Add `quant_config` parameter Propagate to all transformer blocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Add FP8 quantization support for Wan 2.2 Transformer #1042

Motivation.

Proposed Change.

1. Transformer Layer Modifications

2. Pipeline Integration

3. CLI Support

4. Tests

Acceptance Criteria

Reference Implementation

Additional Context

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Description
1	Extract quantization config from `OmniDiffusionConfig` using `get_vllm_quant_config_for_layers()`
2	Pass the extracted `quant_config` to `WanTransformer3DModel` initialization

Test Case	Description
Unit tests for Wan 2.2 with FP8 quantization	Run full forward pass with `quant_config` enabled (FP8) and verify no crashes / reasonable outputs
Config propagation verification	Check that `quant_config` reaches all linear layers in the transformer blocks (self-attn QKV/out, cross-attn QKV/out, FFN linears)
Null config fallback	Ensure existing Wan 2.2 behavior is unchanged when `quant_config=None` (no quantization applied)

Component	File Path	Notes
Transformer	`vllm_omni/diffusion/models/z_image/z_image_transformer.py`	Shows how `quant_config` is threaded through attention / feed-forward layers
Pipeline	`vllm_omni/diffusion/models/z_image/pipeline_z_image.py`	Shows extraction from config and passing to model init
Commit	`b7604ae`	Full diff / context for the Z-Image FP8 PR

Quantization Mode	Supported GPUs	Compute Capability
Full W8A8 (weights + activations)	Ada Lovelace, Hopper	SM 89+
Weight-only FP8	Turing and newer	SM 75+ (falls back to W8A16 on Ampere via Marlin kernels)

[RFC]: Add FP8 quantization support for Wan 2.2 Transformer #1042

Description

Motivation.

Proposed Change.

1. Transformer Layer Modifications

2. Pipeline Integration

3. CLI Support

4. Tests

Acceptance Criteria

Reference Implementation

Additional Context

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions