[Wan2.2] Optimize memory usage with conditional transformer loading#980
[Wan2.2] Optimize memory usage with conditional transformer loading#980SamitHuang merged 18 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Lin, Fanli <[email protected]>
Signed-off-by: Lin, Fanli <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56e9f566ee
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Please help review this PR, thanks a lot! @linyueqian @hsliuustc0106 |
Signed-off-by: Lin, Fanli <[email protected]>
|
There is an e2e test in |
| if boundary_timestep is not None and t < boundary_timestep and self.transformer_2 is not None: | ||
| current_model = self.transformer_2 | ||
|
|
||
| # Select model based on timestep and boundary_ratio |
There was a problem hiding this comment.
Sorry, I didn't understand this. When you don't need the first transformer, how to offload it?
There was a problem hiding this comment.
When boundary_ratio is set to 1.0, self.transformer will be None. In our current offload logic, None module will be skipped (see
vllm-omni/vllm_omni/diffusion/offload.py
Line 151 in c4220f0
self.transformer_2 in the dit_modules list. In this case, the memory-saving strategy still works, because DiT modules (no matter 1 or 2) and encoders are mutual exclusive.
Yes, the e2e test passes: ========================================== short test summary info ===========================================
PASSED tests/e2e/offline_inference/test_t2v_model.py::test_video_diffusion_model[Wan-AI/Wan2.2-T2V-A14B-Diffusers]
================================== 1 passed, 2 warnings in 75.59s (0:01:15) ================================== |
Signed-off-by: Lin, Fanli <[email protected]>
|
I wonder whether we will get bad generation results when boundary_ratio is set to 0.0 or 1.0. |
Signed-off-by: Lin, Fanli <[email protected]>
Indeed, it doesn't make practical sense to set boundary_ratio to 0, because it would produce worse quality. But setting boundary_ratio to 1 actually creates a higher-quality video based on my experiment. Also, many people are saying that the high-noise transformer is not useful (e.g. huggingface/diffusers#12019). Doc updated to make it clearer for users. |
Signed-off-by: Samit <[email protected]>
Signed-off-by: Samit <[email protected]>
Signed-off-by: Lin, Fanli <[email protected]>
…llm-project#980) Signed-off-by: Lin, Fanli <[email protected]> Signed-off-by: Samit <[email protected]> Co-authored-by: Samit <[email protected]>
Purpose
Wan2.2 uses a two-stage denoising process with two separate transformer models:
Loading both transformers simultaneously causes OOM issues on systems with limited GPU memory, as each transformer can consume ~27 GB of memory.
This PR implements conditional transformer loading based on the
boundary_ratioparameter, following the approach in this PR from HF diffusers: huggingface/diffusers#12024 .The pipeline now intelligently loads only the transformers that will actually be used:
boundary_ratio0.0transformer(high-noise)1.0transformer_2(low-noise)0.0 < x < 1.0NoneTest Plan
Test Result
Memory usage before:
Memory usage after:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.