[Perf]: CFG parallel abstraction #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

hsliuustc0106 merged 68 commits into vllm-project:main from wtomin:cfg-base-pipeline

Jan 30, 2026

docs/user_guide/diffusion/parallelism_acceleration.md

wtomin marked this conversation as resolved.

Show resolved Hide resolved

-Original file line number
+Diff line change
@@ Expand Up / @@ -373,37 +373,60 @@ def forward(self, hidden_states, ...): @@
     ### CFG-Parallel
-    ##### Offline Inference
+    #### Offline Inference
-    CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=...)`. The recommended configuration is `cfg_parallel_size=2` (one rank for the positive branch and one rank for the negative branch).
+    CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=2)`, which runs one rank for the positive branch and one rank for the negative branch.
     An example of offline inference using CFG-Parallel (image-to-image) is shown below:
     ```python
     from vllm_omni import Omni
     from vllm_omni.diffusion.data import DiffusionParallelConfig
+    image_path = "path_to_image.png"
     omni = Omni(
         model="Qwen/Qwen-Image-Edit",
         parallel_config=DiffusionParallelConfig(cfg_parallel_size=2),
     )
+    input_image = Image.open(image_path).convert("RGB")
     outputs = omni.generate(
         {
             "prompt": "turn this cat to a dog",
             "negative_prompt": "low quality, blurry",
+            "multi_modal_data": {"image": input_image},
         },
         OmniDiffusionSamplingParams(
             true_cfg_scale=4.0,
-            pil_image=input_image,
             num_inference_steps=50,
         ),
     )
     ```
     Notes:
-    - CFG-Parallel is only effective when **true CFG** is enabled (i.e., `true_cfg_scale > 1` and a `negative_prompt` is provided).
+    - CFG-Parallel is only effective when a `negative_prompt` is provided AND a guidance scale (or `cfg_scale`) is greater than 1.
+    See `examples/offline_inference/image_to_image/image_edit.py` for a complete working example.
+    ```bash
+    cd examples/offline_inference/image_to_image/
+    python image_edit.py \
+      --model "Qwen/Qwen-Image-Edit" \
+      --image "qwen_image_output.png" \
+      --prompt "turn this cat to a dog" \
+      --negative_prompt "low quality, blurry" \
+      --cfg_scale 4.0 \
+      --output "edited_image.png" \
+      --cfg_parallel_size 2
+    ```
+    #### Online Serving
+    You can enable CFG-Parallel in online serving for diffusion models via `--cfg-parallel-size`:
+    ```bash
+    vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2
+    ```
     #### How to parallelize a pipeline
@@ Expand All @@
     CFG-Parallel assigns these two branches to different ranks in the **CFG group** and synchronizes the results.
-    Below is an example of CFG-Parallel implementation:
+    vLLM-omni provides `CFGParallelMixin` base class that encapsulates the CFG parallel logic. By inheriting from this mixin and calling its methods, pipelines can easily implement CFG parallel without writing repetitive code.
+    **Key Methods in CFGParallelMixin:**
+    - `predict_noise_maybe_with_cfg()`: Automatically handles CFG parallel noise prediction
+    - `scheduler_step_maybe_with_cfg()`: Scheduler step with automatic CFG rank synchronization
+    **Example Implementation:**
     ```python
-    def diffuse(
+    class QwenImageCFGParallelMixin(CFGParallelMixin):
+        """
+        Base Mixin class for Qwen Image pipelines providing shared CFG methods.
+        """
+        def diffuse(
             self,
-            ...
-            ):
-        # Enable CFG-parallel: rank0 computes positive, rank1 computes negative.
-        cfg_parallel_ready = do_true_cfg and get_classifier_free_guidance_world_size() > 1
-        self.transformer.do_true_cfg = do_true_cfg
-        if cfg_parallel_ready:
-            cfg_group = get_cfg_group()
-            cfg_rank = get_classifier_free_guidance_rank()
-            if cfg_rank == 0:
-                local_pred = self.transformer(
-                    hidden_states=latents,
-                    timestep=timestep / 1000,
-                    guidance=guidance,
-                    encoder_hidden_states_mask=prompt_embeds_mask,
-                    encoder_hidden_states=prompt_embeds,
-                    img_shapes=img_shapes,
-                    txt_seq_lens=txt_seq_lens,
-                    attention_kwargs=self.attention_kwargs,
-                    return_dict=False,
-                )[0]
-            else:
-                local_pred = self.transformer(
-                    hidden_states=latents,
-                    timestep=timestep / 1000,
-                    guidance=guidance,
-                    encoder_hidden_states_mask=negative_prompt_embeds_mask,
-                    encoder_hidden_states=negative_prompt_embeds,
-                    img_shapes=img_shapes,
-                    txt_seq_lens=negative_txt_seq_lens,
-                    attention_kwargs=self.attention_kwargs,
-                    return_dict=False,
-                )[0]
-            gathered = cfg_group.all_gather(local_pred, separate_tensors=True)
-            if cfg_rank == 0:
-                noise_pred = gathered[0]
-                neg_noise_pred = gathered[1]
-                comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
-                cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
-                noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
-                noise_pred = comb_pred * (cond_norm / noise_norm)
-                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
-            cfg_group.broadcast(latents, src=0)
-        else:
-            # fallback: run positive then negative sequentially on one rank
-            ...
+            prompt_embeds: torch.Tensor,
+            prompt_embeds_mask: torch.Tensor,
+            negative_prompt_embeds: torch.Tensor,
+            negative_prompt_embeds_mask: torch.Tensor,
+            latents: torch.Tensor,
+            img_shapes: torch.Tensor,
+            txt_seq_lens: torch.Tensor,
+            negative_txt_seq_lens: torch.Tensor,
+            timesteps: torch.Tensor,
+            do_true_cfg: bool,
+            guidance: torch.Tensor,
+            true_cfg_scale: float,
+            image_latents: torch.Tensor | None = None,
+            cfg_normalize: bool = True,
+            additional_transformer_kwargs: dict[str, Any] | None = None,
+        ) -> torch.Tensor:
+            self.transformer.do_true_cfg = do_true_cfg
+            for i, t in enumerate(timesteps):
+                timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)
+                # Prepare kwargs for positive (conditional) prediction
+                positive_kwargs = {
+                    "hidden_states": latents,
+                    "timestep": timestep / 1000,
+                    "guidance": guidance,
+                    "encoder_hidden_states_mask": prompt_embeds_mask,
+                    "encoder_hidden_states": prompt_embeds,
+                    "img_shapes": img_shapes,
+                    "txt_seq_lens": txt_seq_lens,
+                }
+                # Prepare kwargs for negative (unconditional) prediction
+                if do_true_cfg:
+                    negative_kwargs = {
+                        "hidden_states": latents,
+                        "timestep": timestep / 1000,
+                        "guidance": guidance,
+                        "encoder_hidden_states_mask": negative_prompt_embeds_mask,
+                        "encoder_hidden_states": negative_prompt_embeds,
+                        "img_shapes": img_shapes,
+                        "txt_seq_lens": negative_txt_seq_lens,
+                    }
+                else:
+                    negative_kwargs = None
+                # Predict noise with automatic CFG parallel handling
+                # - In CFG parallel mode: rank0 computes positive, rank1 computes negative
+                # - Automatically gathers results and combines them on rank0
+                noise_pred = self.predict_noise_maybe_with_cfg(
+                    do_true_cfg=do_true_cfg,
+                    true_cfg_scale=true_cfg_scale,
+                    positive_kwargs=positive_kwargs,
+                    negative_kwargs=negative_kwargs,
+                    cfg_normalize=cfg_normalize,
+                )
+                # Step scheduler with automatic CFG synchronization
+                # - Only rank0 computes the scheduler step
+                # - Automatically broadcasts updated latents to all ranks
+                latents = self.scheduler_step_maybe_with_cfg(
+                    noise_pred, t, latents, do_true_cfg
+                )
+            return latents
+    ```
+    **How it works:**
+. Prepare separate `positive_kwargs` and `negative_kwargs` for conditional and unconditional predictions
+. Call `predict_noise_maybe_with_cfg()` which:
+       - Detects if CFG parallel is enabled (`get_classifier_free_guidance_world_size() > 1`)
+       - Distributes computation: rank0 processes positive, rank1 processes negative
+       - Gathers predictions and combines them using `combine_cfg_noise()` on rank0
+       - Returns combined noise prediction (only valid on rank0)
+. Call `scheduler_step_maybe_with_cfg()` which:
+       - Only rank0 computes the scheduler step
+       - Broadcasts the updated latents to all ranks for synchronization
+    **How to customize**
+    Some pipelines may need to customize the following functions in `CFGParallelMixin`:
+. You may need to edit `predict_noise` function for custom behaviors.
+    ```python
+    def predict_noise(self, *args, **kwargs):
+        """
+        Forward pass through transformer to predict noise.
+        Subclasses should override this if they need custom behavior,
+        but the default implementation calls self.transformer.
+        """
+        return self.transformer(*args, **kwargs)[0]
+    ```
+. The default normalization function after combining the noise predictions from both branches is as follows. You may need to customize it.
+    ```python
+    def cfg_normalize_function(self, noise_pred, comb_pred):
+        """
+        Normalize the combined noise prediction.
+        Args:
+            noise_pred: positive noise prediction
+            comb_pred: combined noise prediction after CFG
+        Returns:
+            Normalized noise prediction tensor
+        """
+        cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+        noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+        noise_pred = comb_pred * (cond_norm / noise_norm)
+        return noise_pred
     ```

docs/user_guide/diffusion_acceleration.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -39,23 +39,23 @@ The following table shows which models are currently supported by each accelerat
  
    | Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |

    |-------|------------------|:----------:|:-----------:|:-----------:|:----------------:|:----------------:|

    | **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ❌ |

    | **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ❌ |

    | **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | ❌ |

    | **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |

    | **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ✅ |

    | **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |

    | **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ |

    | **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | ✅ | ✅ | ✅ |

    | **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ✅ |

    | **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ✅ |

    | **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ❌ | ✅ | ✅ | ✅ | ✅ |

    | **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ❌ | ❌ |

    | **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | ❌ |

    | **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | ✅ |

    | **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ❌ | ❌ | ❌ |

    ### VideoGen

    | Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |

    |-------|------------------|:--------:|:---------:|:----------:|:--------------:|

    | **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ❌ | ❌ |

    | Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |

    |-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|

    | **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ |

    ## Performance Benchmarks

docs/user_guide/examples/offline_inference/image_to_image.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -47,6 +47,7 @@ Key arguments: @@
     - `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
     - `--prompt` / `--negative_prompt`: text description (string).
     - `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
+    - `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
     - `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
     - `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
     - `--output`: path to save the generated PNG.
@@ Expand Down @@

docs/user_guide/examples/offline_inference/image_to_video.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -52,6 +52,7 @@ Key arguments: @@
     - `--num_frames`: Number of frames (default 81).
     - `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
     - `--negative_prompt`: Optional list of artifacts to suppress.
+    - `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
     - `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
     - `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
     - `--num_inference_steps`: Number of denoising steps (default 50).
@@ Expand Down @@

docs/user_guide/examples/offline_inference/text_to_image.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -95,6 +95,7 @@ Key arguments: @@
     - `--prompt`: text description (string).
     - `--seed`: integer seed for deterministic sampling.
     - `--cfg_scale`: true CFG scale (model-specific guidance strength).
+    - `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
     - `--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
     - `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
     - `--height/--width`: output resolution (defaults 1024x1024).
@@ Expand Down @@

docs/user_guide/examples/offline_inference/text_to_video.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -28,6 +28,7 @@ Key arguments: @@
     - `--num_frames`: Number of frames (Wan default is 81).
     - `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
     - `--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
+    - `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
     - `--boundary_ratio`: Boundary split ratio for low/high DiT.
     - `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
     - `--output`: path to save the generated video.
@@ Expand Down @@

examples/offline_inference/image_to_image/image_to_image.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -49,6 +49,7 @@ Key arguments: @@
     - `--output`: path to save the generated PNG.
     - `--vae_use_slicing`: enable VAE slicing for memory optimization.
     - `--vae_use_tiling`: enable VAE tiling for memory optimization.
+    - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
     - `--enable-cpu-offload`: enable CPU offloading for diffusion models.
     > ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.

examples/offline_inference/image_to_video/README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -56,6 +56,7 @@ Key arguments: @@
     - `--output`: Path to save the generated video.
     - `--vae_use_slicing`: Enable VAE slicing for memory optimization.
     - `--vae_use_tiling`: Enable VAE tiling for memory optimization.
+    - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
     - `--enable-cpu-offload`: enable CPU offloading for diffusion models.
     > ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.

examples/offline_inference/image_to_video/image_to_video.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -26,6 +26,7 @@ @@
     import PIL.Image
     import torch
+    from vllm_omni.diffusion.data import DiffusionParallelConfig
     from vllm_omni.entrypoints.omni import Omni
     from vllm_omni.inputs.data import OmniDiffusionSamplingParams
     from vllm_omni.outputs import OmniRequestOutput
@@ Expand Down Expand Up / @@ -85,6 +86,18 @@ def parse_args() -> argparse.Namespace: @@
             default=1,
             help="Number of ready layers (blocks) to keep on GPU during generation.",
         )
+        parser.add_argument(
+            "--cfg_parallel_size",
+            type=int,
+            default=1,
+            choices=[1, 2],
+            help="Number of GPUs used for classifier free guidance parallel size.",
+        )
+        parser.add_argument(
+            "--enforce_eager",
+            action="store_true",
+            help="Disable torch.compile and force eager execution.",
+        )
         return parser.parse_args()
@@ Expand Down Expand Up / @@ -120,7 +133,9 @@ def main(): @@
         # Check if profiling is requested via environment variable
         profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+        parallel_config = DiffusionParallelConfig(
+            cfg_parallel_size=args.cfg_parallel_size,
+        )
         omni = Omni(
             model=args.model,
             enable_layerwise_offload=args.enable_layerwise_offload,
@@ Expand All / @@ -130,12 +145,24 @@ def main(): @@
             boundary_ratio=args.boundary_ratio,
             flow_shift=args.flow_shift,
             enable_cpu_offload=args.enable_cpu_offload,
+            parallel_config=parallel_config,
+            enforce_eager=args.enforce_eager,
         )
         if profiler_enabled:
             print("[Profiler] Starting profiling...")
             omni.start_profile()
+        # Print generation configuration
+        print(f"\n{'=' * 60}")
+        print("Generation Configuration:")
+        print(f"  Model: {args.model}")
+        print(f"  Inference steps: {args.num_inference_steps}")
+        print(f"  Frames: {args.num_frames}")
+        print(f"  Parallel configuration: cfg_parallel_size={args.cfg_parallel_size}")
+        print(f"  Video size: {args.width}x{args.height}")
+        print(f"{'=' * 60}\n")
         # omni.generate() returns Generator[OmniRequestOutput, None, None]
         frames = omni.generate(
             {
@@ Expand Down @@

examples/offline_inference/text_to_image/README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -98,6 +98,7 @@ Key arguments: @@
     - `--output`: path to save the generated PNG.
     - `--vae_use_slicing`: enable VAE slicing for memory optimization.
     - `--vae_use_tiling`: enable VAE tiling for memory optimization.
+    - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
     - `--enable-cpu-offload`: enable CPU offloading for diffusion models.
     > ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf]: CFG parallel abstraction #851

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!

Uh oh!