Skip to content

Comments

[Wan2.2] Optimize memory usage with conditional transformer loading#980

Merged
SamitHuang merged 18 commits intovllm-project:mainfrom
faaany:wan2.2-oom
Jan 29, 2026
Merged

[Wan2.2] Optimize memory usage with conditional transformer loading#980
SamitHuang merged 18 commits intovllm-project:mainfrom
faaany:wan2.2-oom

Conversation

@faaany
Copy link
Contributor

@faaany faaany commented Jan 27, 2026

Purpose

Wan2.2 uses a two-stage denoising process with two separate transformer models:

  • High-noise stage (transformer): Handles early denoising steps (t >= boundary_timestep)
  • Low-noise stage (transformer_2): Handles final refinement steps (t < boundary_timestep)

Loading both transformers simultaneously causes OOM issues on systems with limited GPU memory, as each transformer can consume ~27 GB of memory.

This PR implements conditional transformer loading based on the boundary_ratio parameter, following the approach in this PR from HF diffusers: huggingface/diffusers#12024 .

The pipeline now intelligently loads only the transformers that will actually be used:

boundary_ratio Loaded Models Memory Savings Use Case
0.0 Only transformer (high-noise) ~30% Use only high-noise stage
1.0 Only transformer_2 (low-noise) ~30% Use only low-noise stage
0.0 < x < 1.0 Both transformers None Standard two-stage pipeline
None Both transformers None Original behavior

Test Plan

python text_to_video.py   --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."   --negative_prompt "<optional quality filter>"   --height 480   --width 640   --num_frames 33   --guidance_scale 4.0   --guidance_scale_high 3.0   --num_inference_steps 40   --fps 16   --output t2v_out.mp4 --boundary_ratio 1

Test Result

Memory usage before:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           15484      C   /usr/bin/python3                        426MiB |
|    0   N/A  N/A           15735      C   /usr/bin/python3                      50688MiB |
+-----------------------------------------------------------------------------------------+

Memory usage after:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           24779      C   /usr/bin/python3                        426MiB |
|    0   N/A  N/A           25028      C   /usr/bin/python3                      78132MiB |
+-----------------------------------------------------------------------------------------+

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

@faaany faaany requested a review from hsliuustc0106 as a code owner January 27, 2026 09:23
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56e9f566ee

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@faaany
Copy link
Contributor Author

faaany commented Jan 27, 2026

Please help review this PR, thanks a lot! @linyueqian @hsliuustc0106

@hsliuustc0106
Copy link
Collaborator

@ZJY0516 @SamitHuang @wtomin

@wtomin
Copy link
Contributor

wtomin commented Jan 28, 2026

There is an e2e test in tests/e2e/offline_inference/test_t2v_model.py. Can you test it with your new feature?

if boundary_timestep is not None and t < boundary_timestep and self.transformer_2 is not None:
current_model = self.transformer_2

# Select model based on timestep and boundary_ratio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't understand this. When you don't need the first transformer, how to offload it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When boundary_ratio is set to 1.0, self.transformer will be None. In our current offload logic, None module will be skipped (see

), leaving only self.transformer_2 in the dit_modules list. In this case, the memory-saving strategy still works, because DiT modules (no matter 1 or 2) and encoders are mutual exclusive.

@faaany
Copy link
Contributor Author

faaany commented Jan 28, 2026

There is an e2e test in tests/e2e/offline_inference/test_t2v_model.py. Can you test it with your new feature?

Yes, the e2e test passes:

========================================== short test summary info ===========================================
PASSED tests/e2e/offline_inference/test_t2v_model.py::test_video_diffusion_model[Wan-AI/Wan2.2-T2V-A14B-Diffusers]
================================== 1 passed, 2 warnings in 75.59s (0:01:15) ==================================

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jan 28, 2026
@SamitHuang
Copy link
Collaborator

I wonder whether we will get bad generation results when boundary_ratio is set to 0.0 or 1.0.
The default value 0.875 seems to be defined according to the training config.

Signed-off-by: Lin, Fanli <[email protected]>
@faaany
Copy link
Contributor Author

faaany commented Jan 28, 2026

I wonder whether we will get bad generation results when boundary_ratio is set to 0.0 or 1.0. The default value 0.875 seems to be defined according to the training config.

Indeed, it doesn't make practical sense to set boundary_ratio to 0, because it would produce worse quality. But setting boundary_ratio to 1 actually creates a higher-quality video based on my experiment. Also, many people are saying that the high-noise transformer is not useful (e.g. huggingface/diffusers#12019).

Doc updated to make it clearer for users.

@SamitHuang SamitHuang merged commit c7f89ef into vllm-project:main Jan 29, 2026
7 checks passed
dongbo910220 pushed a commit to dongbo910220/vllm-omni that referenced this pull request Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants