[Perf]: CFG parallel abstraction by wtomin · Pull Request #851 · vllm-project/vllm-omni

wtomin · 2026-01-19T13:28:08Z

Purpose

As the PR for RFC #850 , it tries to implement the CFG parallelization abstraction for diffusion pipelines in vLLM-Omni via CFGParallelMixin.

CFGParallelMixin is a shared abstraction that enables diffusion pipelines to perform classifier-free guidance (CFG) either sequentially (single process) or in parallel across a dedicated CFG process group (rank-split conditional/unconditional forward passes).

See QwenImageCFGParallelMixin.diffuse as an example.

Test Plan

Unit test

pytest -s -v tests/diffusion/distributed/test_cfg_parallel.py

major test test_predict_noise_maybe_with_cfg
Purpose: Verifies that CFG parallel produces numerically identical results compared to sequential CFG execution.
test_predict_noise_without_cfg
Purpose: Tests the case when CFG is disabled (do_true_cfg=False).

image generation

Five models are tested: FLUX.2-KLEIN-4B, LONGCAT-IMAGE, OVIS-IMAGE, QWEN-IMAGE, STABLE-DIFFUSION-3

The bash script to run all t2i tasks

#!/bin/bash

# Script to run text-to-image inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni'"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|scale_arg|scale_value"
declare -a MODELS=(
  "Qwen-Image|Qwen/Qwen-Image|cfg_scale|4.0"
  "FLUX.2-klein-4B|black-forest-labs/FLUX.2-klein-4B|guidance_scale|4.0"
  "LongCat-Image|meituan-longcat/LongCat-Image|guidance_scale|4.0"
  "Ovis-Image|AIDC-AI/Ovis-Image-7B|guidance_scale|4.0"
  "Stable-Diffusion-3|stabilityai/stable-diffusion-3.5-medium|guidance_scale|4.0"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting text-to-image inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path scale_arg scale_value <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      if python examples/offline_inference/text_to_image/text_to_image.py \
        --model "$model_path" \
        --${scale_arg} "$scale_value" \
        --prompt "$PROMPT" \
        --negative_prompt "$NEGATIVE_PROMPT" \
        --output "$output_file" \
        $eager_args \
        $cfg_args \
        2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

image edit

Four models are tested: Qwen-Image-Edit, Qwen-Image-Edit-2509, Qwen-Image-Layered, LongCat-Image-Edit
(Because of #1002 , Qwen-Image-Layered failed with shape error)

The bash script to run all image edit tasks

#!/bin/bash

# Script to run image-to-image (image edit) inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="turn this bunny to a cat"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"
NUM_INFERENCE_STEPS=50

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|input_image|output_prefix|scale_arg|scale_value|extra_args"
declare -a MODELS=(
  "Qwen-Image-Edit|Qwen/Qwen-Image-Edit|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|output_image_edit|cfg_scale|4.0|"
  "Qwen-Image-Edit-2509|Qwen/Qwen-Image-Edit-2509|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|output_image_edit|cfg_scale|4.0|"
  "Qwen-Image-Layered|Qwen/Qwen-Image-Layered|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|layered|cfg_scale|4.0|--color-format RGBA --output layered --layers 2"
  "LongCat-Image-Edit|meituan-longcat/LongCat-Image-Edit|./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png|output_image_edit|guidance_scale|4.0|"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting image-to-image (image edit) inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. eager + cfg_parallel"
echo "  2. no eager + no_cfg_parallel"
echo "  3. no eager + cfg_parallel"
echo "  4. eager + no_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

# Check if required input images exist
echo "Checking for input images..."
if [ ! -f "./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png" ]; then
  echo "⚠ Warning: ./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png not found. Qwen models may fail."
fi
if [ ! -f "./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png" ]; then
  echo "⚠ Warning: ./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png not found. LongCat model may fail."
fi
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path input_image output_prefix scale_arg scale_value extra_args <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${output_prefix}_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      cmd="python examples/offline_inference/image_to_image/image_edit.py \
        --model \"$model_path\" \
        --image \"$input_image\" \
        --prompt \"$PROMPT\" \
        --negative_prompt \"$NEGATIVE_PROMPT\" \
        --output \"$output_file\" \
        --num_inference_steps $NUM_INFERENCE_STEPS \
        --${scale_arg} $scale_value \
        $extra_args \
        $eager_args \
        $cfg_args"
      
      if eval "$cmd" 2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

Video Generation

The bash script to run all video generation tasks

#!/bin/bash

# Script to run text-to-video, image-to-video, and text+image-to-video inference for all supported models
# Testing combinations of eager mode and CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni', dancing from left to right"
NEGATIVE_PROMPT="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"

# Video generation parameters
HEIGHT=480
WIDTH=832
NUM_FRAMES=33
BOUNDARY_RATIO=0.875
NUM_INFERENCE_STEPS=40
FPS=16

# Input image for I2V and TI2V models
INPUT_IMAGE="./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|flow_shift|guidance_scale|guidance_scale_high|needs_image"
declare -a MODELS=(
  "Wan2.2-T2V|Wan-AI/Wan2.2-T2V-A14B-Diffusers|12.0|4.0|4.0|no"
  "Wan2.2-I2V|Wan-AI/Wan2.2-I2V-A14B-Diffusers|12.0|5.0|6.0|yes"
  "Wan2.2-TI2V|Wan-AI/Wan2.2-TI2V-5B-Diffusers|12.0|4.0||yes"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting text/image-to-video inference tests"
echo "Testing 3 models:"
echo "  - Wan2.2-T2V (text-to-video)"
echo "  - Wan2.2-I2V (image-to-video)"
echo "  - Wan2.2-TI2V (text+image-to-video)"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

# Check if input image exists for I2V and TI2V models
if [ ! -f "$INPUT_IMAGE" ]; then
  echo "⚠ Warning: Input image not found: $INPUT_IMAGE"
  echo "   I2V and TI2V models may fail. Please run text-to-image tests first."
  echo ""
fi

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path flow_shift guidance_scale guidance_scale_high needs_image <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      base_name="${base_name//./_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.mp4"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build command based on model type
      # Use image_to_video.py for I2V/TI2V models, text_to_video.py for T2V
      if [ "$needs_image" = "yes" ]; then
        # Image-to-Video or Text+Image-to-Video
        cmd="python examples/offline_inference/image_to_video/image_to_video.py \
          --model \"$model_path\" \
          --image \"$INPUT_IMAGE\" \
          --prompt \"$PROMPT\" \
          --negative_prompt \"$NEGATIVE_PROMPT\" \
          --height $HEIGHT \
          --width $WIDTH \
          --num_frames $NUM_FRAMES \
          --guidance_scale $guidance_scale"
        
        # Add guidance_scale_high if specified
        if [ -n "$guidance_scale_high" ]; then
          cmd="$cmd --guidance_scale_high $guidance_scale_high"
        fi
        
        cmd="$cmd \
          --boundary_ratio $BOUNDARY_RATIO \
          --num_inference_steps $NUM_INFERENCE_STEPS \
          --flow_shift $flow_shift \
          --fps $FPS \
          --output \"$output_file\" \
          $eager_args \
          $cfg_args"
      else
        # Text-to-Video
        cmd="python examples/offline_inference/text_to_video/text_to_video.py \
          --model \"$model_path\" \
          --prompt \"$PROMPT\" \
          --negative_prompt \"$NEGATIVE_PROMPT\" \
          --height $HEIGHT \
          --width $WIDTH \
          --num_frames $NUM_FRAMES \
          --guidance_scale $guidance_scale"
        
        # Add guidance_scale_high if specified
        if [ -n "$guidance_scale_high" ]; then
          cmd="$cmd --guidance_scale_high $guidance_scale_high"
        fi
        
        cmd="$cmd \
          --boundary_ratio $BOUNDARY_RATIO \
          --num_inference_steps $NUM_INFERENCE_STEPS \
          --flow_shift $flow_shift \
          --fps $FPS \
          --output \"$output_file\" \
          $eager_args \
          $cfg_args"
      fi
      
      # Execute command
      if eval "$cmd" 2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (videos and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

Test Result

Unit test

========================================================================= 3 passed, 3 warnings in 458.11s (0:07:38) ==========================================================================

Passed all tests.

Setup
- vllm: 0.14.0
- python: 3.12
- pytorch: 2.9.1
- batch size: 1
- platform: H800 sever
Text-To-Image

model	cfg_parallel_size	time (torch.compile)	time (eager)
`FLUX.2-KLEIN-4B`	1	6.29s	8.17s
`FLUX.2-KLEIN-4B`	2	5.15s	6.20s
`LongCat-Image`	1	20.36s	20.38s
`LongCat-Image`	2	12.78s	12.71s
`Ovis-Image`	1	9.18s	11.85s
`Ovis-Image`	2	6.65s	7.89s
`Qwen-Image`	1	13.89s	17.80s
`Qwen-Image`	2	9.60s	11.28s
`Stable-Diffusion-3`	1	2.98s	4.92s
`Stable-Diffusion-3`	2	3.43s	4.83s

The speed acceleration performance of SD3 is not good.

Image-Edit

model	cfg_parallel_size	time (torch.compile)	time (eager)
`Qwen-Image-Edit`	1	35.55s	43.22s
`Qwen-Image-Edit`	2	20.29s	24.21s
`Qwen-Image-Edit-2509`	1	29.96s	37.17s
`Qwen-Image-Edit-2509`	2	17.21s	21.76s
`LongCat-Image-Edit`	1	41.11s	41.10s
`LongCat-Image-Edit`	2	24.27s	23.07s

Video Generation

HEIGHT=480
WIDTH=832
NUM_FRAMES=33
NUM_INFERENCE_STEPS=40
FPS=16

model	cfg_parallel_size	time (torch.compile)	video
`Wan-AI/Wan2.2-T2V-A14B-Diffusers`	1	80.9s	https://github.com/user-attachments/assets/561d454b-8a37-4599-8c39-914bcde28085
`Wan-AI/Wan2.2-T2V-A14B-Diffusers`	2	42.2s	https://github.com/user-attachments/assets/5a34c7da-d957-4060-b695-3bdbc827f977
`Wan-AI/Wan2.2-I2V-A14B-Diffusers`	1	81.7s	https://github.com/user-attachments/assets/2e5b26ff-7533-4f98-ad14-f7bb66eb340a
`Wan-AI/Wan2.2-I2V-A14B-Diffusers`	2	43.2s	https://github.com/user-attachments/assets/4af12971-1609-4565-827d-4374e1880e16
`Wan-AI/Wan2.2-TI2V-5B-Diffusers`	1	13.4s	https://github.com/user-attachments/assets/c06885fb-3d79-4ab4-8a5f-b1b2bd9919d8
`Wan-AI/Wan2.2-TI2V-5B-Diffusers`	2	10.5s	https://github.com/user-attachments/assets/6b49a06a-3e5b-407f-b330-97d3f02e0c06

CC List.

@ZJY0516 @SamitHuang @david6666666 @hsliuustc0106

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

vllm_omni/diffusion/distributed/cfg_parallel.py

vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py

vllm_omni/diffusion/distributed/cfg_parallel.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 739b668791

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

examples/offline_inference/text_to_video/text_to_video.py

Copilot

Pull request overview

This PR introduces a shared classifier-free guidance (CFG) parallelization abstraction via CFGParallelMixin (and QwenImageCFGParallelMixin) and refactors multiple diffusion pipelines and examples to use it, enabling rank-split conditional/unconditional denoising across a dedicated CFG process group. It also wires CFG-parallel configuration into the offline video examples and updates the user documentation to describe and advertise CFG-Parallel support for the relevant models.

Changes:

Add CFGParallelMixin and QwenImageCFGParallelMixin implementing reusable predict_noise_maybe_with_cfg and scheduler_step_maybe_with_cfg helpers for both sequential and CFG-parallel execution.
Refactor image and video diffusion pipelines (Qwen-Image*, LongCat-Image*, Ovis-Image, Flux2-Klein, Wan2.2 T2V/I2V/TI2V, Stable-Diffusion-3) to use the new mixins instead of ad-hoc CFG logic, preserving editing-specific slicing and normalization behaviors.
Extend offline text-to-video and image-to-video examples and the diffusion acceleration docs to expose cfg_parallel_size, describe CFG-Parallel usage, and mark supported models appropriately.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`vllm_omni/diffusion/distributed/cfg_parallel.py`	Introduces `CFGParallelMixin` and `QwenImageCFGParallelMixin`, encapsulating CFG sequential/parallel noise prediction, combination, optional normalization, and synchronized scheduler stepping.
`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py`	Switches `QwenImagePipeline` to inherit `QwenImageCFGParallelMixin` and delegate its diffusion loop to the shared CFG-aware `diffuse` implementation.
`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit.py`	Refactors Qwen image edit pipeline to use `QwenImageCFGParallelMixin.diffuse`, passing image latents and enabling CFG normalization through the mixin.
`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit_plus.py`	Same as above for the “Edit Plus” variant, delegating CFG-parallel diffusion (with normalization) to the mixin and passing attention kwargs through.
`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_layered.py`	Adopts `QwenImageCFGParallelMixin`, removing custom CFG-parallel logic and routing layered-image diffusion (with image latents and extra transformer kwargs) through the shared mixin.
`vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image.py`	Makes `LongCatImagePipeline` a `CFGParallelMixin` user, replacing inline CFG math with `predict_noise_maybe_with_cfg`/`scheduler_step_maybe_with_cfg` and adding an overridable `cfg_normalize_function` plus `scheduler_step` wrapper.
`vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image_edit.py`	Enables CFG parallelism for LongCat image editing via `CFGParallelMixin`, refactors the loop to call `predict_noise_maybe_with_cfg` (with output slicing) and `scheduler_step_maybe_with_cfg`, and adds a local `scheduler_step`.
`vllm_omni/diffusion/models/ovis_image/pipeline_ovis_image.py`	Refactors Ovis-Image denoising into a `diffuse` function using `CFGParallelMixin` helpers, plus a custom `scheduler_step` that preserves MPS dtype behavior.
`vllm_omni/diffusion/models/flux2_klein/pipeline_flux2_klein.py`	Updates Flux.2-Klein’s loop to use `CFGParallelMixin` CFG handling and scheduler stepping, including slicing when image latents are concatenated (editing mode), and defines a scheduler-step wrapper.
`vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py`	Makes Wan2.2 T2V pipeline inherit `CFGParallelMixin` and replace inline CFG logic with `predict_noise_maybe_with_cfg` and `scheduler_step_maybe_with_cfg`, while still supporting dual-transformer guidance scales.
`vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py`	Same refactor for Wan2.2 I2V, building positive/negative kwargs (including image encoder embeds) and delegating CFG/no-CFG behavior to the mixin plus a pipeline-specific `predict_noise`.
`vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py`	Same pattern for Wan2.2 TI2V, with patch-wise timesteps and a local `predict_noise` helper used by the mixin.
`vllm_omni/diffusion/models/sd3/pipeline_sd3.py`	Makes SD3 pipeline a `CFGParallelMixin` user, introduces a dedicated `diffuse` method that calls `predict_noise_maybe_with_cfg`/`scheduler_step_maybe_with_cfg`, and wires `forward` through this method.
`examples/offline_inference/text_to_video/text_to_video.py`	Imports `DiffusionParallelConfig`, adds `--cfg_parallel_size` CLI flag, includes it in `DiffusionParallelConfig`, and passes the parallel config plus `enforce_eager` into `Omni`.
`examples/offline_inference/image_to_video/image_to_video.py`	Adds `DiffusionParallelConfig`, `--cfg_parallel_size`, and `--enforce_eager` support; constructs `parallel_config` with the requested CFG parallel size and passes it into `Omni`.
`docs/user_guide/diffusion_acceleration.md`	Updates acceleration support tables to mark LongCat, Ovis, SD3, and Wan2.2 as CFG-Parallel capable and extends the VideoGen table with a CFG-Parallel column.
`docs/user_guide/diffusion/parallelism_acceleration.md`	Rewrites the CFG-Parallel section to use `CFGParallelMixin`/`QwenImageCFGParallelMixin` as the canonical examples, documenting `predict_noise_maybe_with_cfg`, `scheduler_step_maybe_with_cfg`, and customization points like `predict_noise` and `cfg_normalize_function`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_omni/diffusion/distributed/cfg_parallel.py

vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py

docs/user_guide/diffusion/parallelism_acceleration.md

SamitHuang · 2026-01-29T03:15:22Z

examples/offline_inference/text_to_video/text_to_video.py

+        "--cfg_parallel_size",
+        type=int,
+        default=1,
+        choices=[1, 2],


what is the meaning of setting CFG parallel size to 1?

also, this cfg size checking should also be done in the CFG parallelism implementation, not just the offline examples

CFG parallel default size is 1, because in vll_omni/diffusion/data.py, world_size is defined as a product of multiple parallel sizes:

self.world_size = ( self.pipeline_parallel_size * self.data_parallel_size * self.tensor_parallel_size * self.ulysses_degree * self.ring_degree * self.cfg_parallel_size )

Besides, I revised data.py to check cfg size in configuration.

examples/offline_inference/text_to_image/README.md

docs/user_guide/diffusion/parallelism_acceleration.md

vllm_omni/diffusion/distributed/cfg_parallel.py

tests/diffusion/distributed/test_cfg_parallel.py

vllm_omni/diffusion/distributed/cfg_parallel.py

vllm_omni/diffusion/models/flux2_klein/pipeline_flux2_klein.py

ZJY0516 · 2026-01-30T06:06:31Z

@wtomin Should we also add a e2e test using riverclouds/qwen_image_random

wtomin · 2026-01-30T07:13:00Z

@wtomin Should we also add a e2e test using riverclouds/qwen_image_random

Maybe slow. I asked @Gaohan123 and the conclusion is that we should use unit test for now.

wtomin · 2026-01-30T08:06:09Z

vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py

+            # Compute the previous noisy sample x_t -> x_t-1 with automatic CFG sync
+            latents = self.scheduler_step_maybe_with_cfg(noise_pred, t, latents, do_true_cfg)

+        if torch.cuda.is_available():


@ZJY0516 @SamitHuang I have moved the per-step torch.cuda.empty_cache() call to the pipeline of Wan2.2 series models.

Now it works fine with 480x832x33 resolution, and only call it once.

I'm afraid it will break other platform, for example, could non-CUDA platforms run into OOM errors? And We'd better add a comment for why we do this here

Of course. Changed torch.cuda.empty_cache() to current_omni_platform.empty_cache() and add a comment on why adding it here.

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

wtomin changed the title ~~[Refactor]: CFG parallel abstraction~~ [WIP][Refactor]: CFG parallel abstraction Jan 19, 2026

wtomin mentioned this pull request Jan 20, 2026

[RFC]: Diffusion Models Features Supports Plan #814

Open

53 tasks

wtomin force-pushed the cfg-base-pipeline branch 2 times, most recently from 75e8ef4 to 51891b6 Compare January 26, 2026 11:13

david6666666 added this to the v0.14.0 milestone Jan 27, 2026

SamitHuang reviewed Jan 27, 2026

View reviewed changes

vllm_omni/diffusion/distributed/cfg_parallel.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py Show resolved Hide resolved

SamitHuang reviewed Jan 27, 2026

View reviewed changes

vllm_omni/diffusion/distributed/cfg_parallel.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/distributed/cfg_parallel.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/distributed/cfg_parallel.py Show resolved Hide resolved

wtomin force-pushed the cfg-base-pipeline branch from 1f02307 to 739b668 Compare January 27, 2026 09:14

wtomin marked this pull request as ready for review January 27, 2026 09:17

wtomin requested a review from hsliuustc0106 as a code owner January 27, 2026 09:17

wtomin changed the title ~~[WIP][Refactor]: CFG parallel abstraction~~ [Refactor]: CFG parallel abstraction Jan 27, 2026

wtomin changed the title ~~[Refactor]: CFG parallel abstraction~~ [Perf]: CFG parallel abstraction Jan 27, 2026

chatgpt-codex-connector bot reviewed Jan 27, 2026

View reviewed changes

examples/offline_inference/text_to_video/text_to_video.py Outdated Show resolved Hide resolved

examples/offline_inference/text_to_video/text_to_video.py Outdated Show resolved Hide resolved

hsliuustc0106 requested a review from Copilot January 28, 2026 01:54

Copilot started reviewing on behalf of hsliuustc0106 January 28, 2026 01:55 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

vllm_omni/diffusion/distributed/cfg_parallel.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/distributed/cfg_parallel.py Show resolved Hide resolved

wtomin force-pushed the cfg-base-pipeline branch from 06a074b to 921223e Compare January 28, 2026 09:14

david6666666 added the high priority high priority issue, needs to be done asap label Jan 28, 2026

wtomin force-pushed the cfg-base-pipeline branch from fea1c6f to fe6728c Compare January 29, 2026 02:03

SamitHuang reviewed Jan 29, 2026

View reviewed changes

vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py Show resolved Hide resolved

docs/user_guide/diffusion/parallelism_acceleration.md Show resolved Hide resolved

SamitHuang reviewed Jan 29, 2026

View reviewed changes

hsliuustc0106 reviewed Jan 29, 2026

View reviewed changes

examples/offline_inference/text_to_image/README.md Show resolved Hide resolved

wtomin force-pushed the cfg-base-pipeline branch from fc5ba2f to b5d7733 Compare January 29, 2026 08:55

ZJY0516 reviewed Jan 29, 2026

View reviewed changes

wtomin force-pushed the cfg-base-pipeline branch from 5ce0e19 to 250a4c1 Compare January 30, 2026 02:49

SamitHuang added the ready label to trigger buildkite CI label Jan 30, 2026

wtomin commented Jan 30, 2026

View reviewed changes

SamitHuang approved these changes Jan 30, 2026

View reviewed changes

wtomin added 21 commits January 30, 2026 17:46

update unit test

8662f8e

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

fix pre-commit error

0b0d71d

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

fix pre-commit error

cfbd49d

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

check cfg_parallel size in data.py

9345694

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

update cfg_parallel_size arg doc

c8bcf2e

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

doc refinement

35dbdeb

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

update doc with new arg

f3a54fe

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

offline script example in doc

8dd8e61

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

online serving args

18ce884

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

serve args

f55beb0

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

update doc

3162aac

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

fix error

da2b307

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

remove no_grad

5189be6

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

remove torch.save & torch.load

c91a3a0

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

update hardward devices

1fdde86

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

mv QwenImageCFGParallelMixin in qwen_image folder

117e0de

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

check cfg_parallel validity in pipelines

027e717

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

fix unit test spawn process error

3579470

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

rm mps related code

6a3070b

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

mv empty_cache to wan pipelines after all diffusion steps

a939170

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

omni_platform and comment

5a9af70

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

wtomin force-pushed the cfg-base-pipeline branch from d6a8fdf to 5a9af70 Compare January 30, 2026 09:46

Merge branch 'main' into cfg-base-pipeline

afe8c7e

hsliuustc0106 enabled auto-merge (squash) January 30, 2026 15:30

hsliuustc0106 merged commit 23cf48d into vllm-project:main Jan 30, 2026
7 checks passed

dongbo910220 pushed a commit to dongbo910220/vllm-omni that referenced this pull request Feb 1, 2026

[Perf]: CFG parallel abstraction (vllm-project#851)

4b8cd1f

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

wtomin deleted the cfg-base-pipeline branch February 2, 2026 07:24

This was referenced Feb 2, 2026

[RFC]: CFG Parallelism Abstraction #850

Closed

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

nuclearwu mentioned this pull request Feb 9, 2026

[feature]: support Flux.1-dev CFG-Parallel #1269

Merged

5 tasks

Comments

Conversation

wtomin commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

CC List.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SamitHuang Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

wtomin Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZJY0516 commented Jan 30, 2026

Uh oh!

wtomin commented Jan 30, 2026

Uh oh!

wtomin Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wtomin Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wtomin commented Jan 19, 2026 •

edited

Loading

ZJY0516 Jan 30, 2026 •

edited

Loading