[Perf]: CFG parallel abstraction#851
Conversation
75e8ef4 to
51891b6
Compare
1f02307 to
739b668
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 739b668791
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR introduces a shared classifier-free guidance (CFG) parallelization abstraction via CFGParallelMixin (and QwenImageCFGParallelMixin) and refactors multiple diffusion pipelines and examples to use it, enabling rank-split conditional/unconditional denoising across a dedicated CFG process group. It also wires CFG-parallel configuration into the offline video examples and updates the user documentation to describe and advertise CFG-Parallel support for the relevant models.
Changes:
- Add
CFGParallelMixinandQwenImageCFGParallelMixinimplementing reusablepredict_noise_maybe_with_cfgandscheduler_step_maybe_with_cfghelpers for both sequential and CFG-parallel execution. - Refactor image and video diffusion pipelines (Qwen-Image*, LongCat-Image*, Ovis-Image, Flux2-Klein, Wan2.2 T2V/I2V/TI2V, Stable-Diffusion-3) to use the new mixins instead of ad-hoc CFG logic, preserving editing-specific slicing and normalization behaviors.
- Extend offline text-to-video and image-to-video examples and the diffusion acceleration docs to expose
cfg_parallel_size, describe CFG-Parallel usage, and mark supported models appropriately.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_omni/diffusion/distributed/cfg_parallel.py |
Introduces CFGParallelMixin and QwenImageCFGParallelMixin, encapsulating CFG sequential/parallel noise prediction, combination, optional normalization, and synchronized scheduler stepping. |
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py |
Switches QwenImagePipeline to inherit QwenImageCFGParallelMixin and delegate its diffusion loop to the shared CFG-aware diffuse implementation. |
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit.py |
Refactors Qwen image edit pipeline to use QwenImageCFGParallelMixin.diffuse, passing image latents and enabling CFG normalization through the mixin. |
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit_plus.py |
Same as above for the “Edit Plus” variant, delegating CFG-parallel diffusion (with normalization) to the mixin and passing attention kwargs through. |
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_layered.py |
Adopts QwenImageCFGParallelMixin, removing custom CFG-parallel logic and routing layered-image diffusion (with image latents and extra transformer kwargs) through the shared mixin. |
vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image.py |
Makes LongCatImagePipeline a CFGParallelMixin user, replacing inline CFG math with predict_noise_maybe_with_cfg/scheduler_step_maybe_with_cfg and adding an overridable cfg_normalize_function plus scheduler_step wrapper. |
vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image_edit.py |
Enables CFG parallelism for LongCat image editing via CFGParallelMixin, refactors the loop to call predict_noise_maybe_with_cfg (with output slicing) and scheduler_step_maybe_with_cfg, and adds a local scheduler_step. |
vllm_omni/diffusion/models/ovis_image/pipeline_ovis_image.py |
Refactors Ovis-Image denoising into a diffuse function using CFGParallelMixin helpers, plus a custom scheduler_step that preserves MPS dtype behavior. |
vllm_omni/diffusion/models/flux2_klein/pipeline_flux2_klein.py |
Updates Flux.2-Klein’s loop to use CFGParallelMixin CFG handling and scheduler stepping, including slicing when image latents are concatenated (editing mode), and defines a scheduler-step wrapper. |
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py |
Makes Wan2.2 T2V pipeline inherit CFGParallelMixin and replace inline CFG logic with predict_noise_maybe_with_cfg and scheduler_step_maybe_with_cfg, while still supporting dual-transformer guidance scales. |
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py |
Same refactor for Wan2.2 I2V, building positive/negative kwargs (including image encoder embeds) and delegating CFG/no-CFG behavior to the mixin plus a pipeline-specific predict_noise. |
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py |
Same pattern for Wan2.2 TI2V, with patch-wise timesteps and a local predict_noise helper used by the mixin. |
vllm_omni/diffusion/models/sd3/pipeline_sd3.py |
Makes SD3 pipeline a CFGParallelMixin user, introduces a dedicated diffuse method that calls predict_noise_maybe_with_cfg/scheduler_step_maybe_with_cfg, and wires forward through this method. |
examples/offline_inference/text_to_video/text_to_video.py |
Imports DiffusionParallelConfig, adds --cfg_parallel_size CLI flag, includes it in DiffusionParallelConfig, and passes the parallel config plus enforce_eager into Omni. |
examples/offline_inference/image_to_video/image_to_video.py |
Adds DiffusionParallelConfig, --cfg_parallel_size, and --enforce_eager support; constructs parallel_config with the requested CFG parallel size and passes it into Omni. |
docs/user_guide/diffusion_acceleration.md |
Updates acceleration support tables to mark LongCat, Ovis, SD3, and Wan2.2 as CFG-Parallel capable and extends the VideoGen table with a CFG-Parallel column. |
docs/user_guide/diffusion/parallelism_acceleration.md |
Rewrites the CFG-Parallel section to use CFGParallelMixin/QwenImageCFGParallelMixin as the canonical examples, documenting predict_noise_maybe_with_cfg, scheduler_step_maybe_with_cfg, and customization points like predict_noise and cfg_normalize_function. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
06a074b to
921223e
Compare
fea1c6f to
fe6728c
Compare
| "--cfg_parallel_size", | ||
| type=int, | ||
| default=1, | ||
| choices=[1, 2], |
There was a problem hiding this comment.
what is the meaning of setting CFG parallel size to 1?
There was a problem hiding this comment.
also, this cfg size checking should also be done in the CFG parallelism implementation, not just the offline examples
There was a problem hiding this comment.
CFG parallel default size is 1, because in vll_omni/diffusion/data.py, world_size is defined as a product of multiple parallel sizes:
self.world_size = (
self.pipeline_parallel_size
* self.data_parallel_size
* self.tensor_parallel_size
* self.ulysses_degree
* self.ring_degree
* self.cfg_parallel_size
)Besides, I revised data.py to check cfg size in configuration.
fc5ba2f to
b5d7733
Compare
5ce0e19 to
250a4c1
Compare
|
@wtomin Should we also add a e2e test using riverclouds/qwen_image_random |
Maybe slow. I asked @Gaohan123 and the conclusion is that we should use unit test for now. |
| # Compute the previous noisy sample x_t -> x_t-1 with automatic CFG sync | ||
| latents = self.scheduler_step_maybe_with_cfg(noise_pred, t, latents, do_true_cfg) | ||
|
|
||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
@ZJY0516 @SamitHuang I have moved the per-step torch.cuda.empty_cache() call to the pipeline of Wan2.2 series models.
Now it works fine with 480x832x33 resolution, and only call it once.
There was a problem hiding this comment.
I'm afraid it will break other platform, for example, could non-CUDA platforms run into OOM errors? And We'd better add a comment for why we do this here
There was a problem hiding this comment.
Of course. Changed torch.cuda.empty_cache() to current_omni_platform.empty_cache() and add a comment on why adding it here.
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
d6a8fdf to
5a9af70
Compare
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Purpose
As the PR for RFC #850 , it tries to implement the CFG parallelization abstraction for diffusion pipelines in vLLM-Omni via
CFGParallelMixin.CFGParallelMixin is a shared abstraction that enables diffusion pipelines to perform classifier-free guidance (CFG) either sequentially (single process) or in parallel across a dedicated CFG process group (rank-split conditional/unconditional forward passes).
See
QwenImageCFGParallelMixin.diffuseas an example.Test Plan
major test
test_predict_noise_maybe_with_cfgPurpose: Verifies that CFG parallel produces numerically identical results compared to sequential CFG execution.
test_predict_noise_without_cfgPurpose: Tests the case when CFG is disabled (do_true_cfg=False).
Five models are tested: FLUX.2-KLEIN-4B, LONGCAT-IMAGE, OVIS-IMAGE, QWEN-IMAGE, STABLE-DIFFUSION-3
The bash script to run all t2i tasks
Four models are tested: Qwen-Image-Edit, Qwen-Image-Edit-2509, Qwen-Image-Layered, LongCat-Image-Edit
(Because of #1002 ,
Qwen-Image-Layeredfailed with shape error)The bash script to run all image edit tasks
The bash script to run all video generation tasks
Test Result
========================================================================= 3 passed, 3 warnings in 458.11s (0:07:38) ==========================================================================Passed all tests.
Setup
Text-To-Image
FLUX.2-KLEIN-4BFLUX.2-KLEIN-4BLongCat-ImageLongCat-ImageOvis-ImageOvis-ImageQwen-ImageQwen-ImageStable-Diffusion-3Stable-Diffusion-3The speed acceleration performance of SD3 is not good.
Qwen-Image-EditQwen-Image-EditQwen-Image-Edit-2509Qwen-Image-Edit-2509LongCat-Image-EditLongCat-Image-EditWan-AI/Wan2.2-T2V-A14B-DiffusersWan-AI/Wan2.2-T2V-A14B-DiffusersWan-AI/Wan2.2-I2V-A14B-DiffusersWan-AI/Wan2.2-I2V-A14B-DiffusersWan-AI/Wan2.2-TI2V-5B-DiffusersWan-AI/Wan2.2-TI2V-5B-DiffusersCC List.
@ZJY0516 @SamitHuang @david6666666 @hsliuustc0106
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)