Skip to content

Comments

[Perf]: CFG parallel abstraction#851

Merged
hsliuustc0106 merged 68 commits intovllm-project:mainfrom
wtomin:cfg-base-pipeline
Jan 30, 2026
Merged

[Perf]: CFG parallel abstraction#851
hsliuustc0106 merged 68 commits intovllm-project:mainfrom
wtomin:cfg-base-pipeline

Conversation

@wtomin
Copy link
Contributor

@wtomin wtomin commented Jan 19, 2026

Purpose

As the PR for RFC #850 , it tries to implement the CFG parallelization abstraction for diffusion pipelines in vLLM-Omni via CFGParallelMixin.

CFGParallelMixin is a shared abstraction that enables diffusion pipelines to perform classifier-free guidance (CFG) either sequentially (single process) or in parallel across a dedicated CFG process group (rank-split conditional/unconditional forward passes).

See QwenImageCFGParallelMixin.diffuse as an example.

Test Plan

  • Unit test
pytest -s -v tests/diffusion/distributed/test_cfg_parallel.py
  1. major test test_predict_noise_maybe_with_cfg
    Purpose: Verifies that CFG parallel produces numerically identical results compared to sequential CFG execution.

  2. test_predict_noise_without_cfg
    Purpose: Tests the case when CFG is disabled (do_true_cfg=False).

  • image generation

Five models are tested: FLUX.2-KLEIN-4B, LONGCAT-IMAGE, OVIS-IMAGE, QWEN-IMAGE, STABLE-DIFFUSION-3

The bash script to run all t2i tasks
#!/bin/bash

# Script to run text-to-image inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni'"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|scale_arg|scale_value"
declare -a MODELS=(
  "Qwen-Image|Qwen/Qwen-Image|cfg_scale|4.0"
  "FLUX.2-klein-4B|black-forest-labs/FLUX.2-klein-4B|guidance_scale|4.0"
  "LongCat-Image|meituan-longcat/LongCat-Image|guidance_scale|4.0"
  "Ovis-Image|AIDC-AI/Ovis-Image-7B|guidance_scale|4.0"
  "Stable-Diffusion-3|stabilityai/stable-diffusion-3.5-medium|guidance_scale|4.0"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting text-to-image inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path scale_arg scale_value <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      if python examples/offline_inference/text_to_image/text_to_image.py \
        --model "$model_path" \
        --${scale_arg} "$scale_value" \
        --prompt "$PROMPT" \
        --negative_prompt "$NEGATIVE_PROMPT" \
        --output "$output_file" \
        $eager_args \
        $cfg_args \
        2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

  • image edit

Four models are tested: Qwen-Image-Edit, Qwen-Image-Edit-2509, Qwen-Image-Layered, LongCat-Image-Edit
(Because of #1002 , Qwen-Image-Layered failed with shape error)

The bash script to run all image edit tasks
#!/bin/bash

# Script to run image-to-image (image edit) inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="turn this bunny to a cat"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"
NUM_INFERENCE_STEPS=50

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|input_image|output_prefix|scale_arg|scale_value|extra_args"
declare -a MODELS=(
  "Qwen-Image-Edit|Qwen/Qwen-Image-Edit|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|output_image_edit|cfg_scale|4.0|"
  "Qwen-Image-Edit-2509|Qwen/Qwen-Image-Edit-2509|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|output_image_edit|cfg_scale|4.0|"
  "Qwen-Image-Layered|Qwen/Qwen-Image-Layered|./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png|layered|cfg_scale|4.0|--color-format RGBA --output layered --layers 2"
  "LongCat-Image-Edit|meituan-longcat/LongCat-Image-Edit|./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png|output_image_edit|guidance_scale|4.0|"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting image-to-image (image edit) inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. eager + cfg_parallel"
echo "  2. no eager + no_cfg_parallel"
echo "  3. no eager + cfg_parallel"
echo "  4. eager + no_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

# Check if required input images exist
echo "Checking for input images..."
if [ ! -f "./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png" ]; then
  echo "⚠ Warning: ./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png not found. Qwen models may fail."
fi
if [ ! -f "./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png" ]; then
  echo "⚠ Warning: ./LongCat-Image/longcat-image_output_no_eager_no_cfg_parallel.png not found. LongCat model may fail."
fi
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path input_image output_prefix scale_arg scale_value extra_args <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${output_prefix}_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      cmd="python examples/offline_inference/image_to_image/image_edit.py \
        --model \"$model_path\" \
        --image \"$input_image\" \
        --prompt \"$PROMPT\" \
        --negative_prompt \"$NEGATIVE_PROMPT\" \
        --output \"$output_file\" \
        --num_inference_steps $NUM_INFERENCE_STEPS \
        --${scale_arg} $scale_value \
        $extra_args \
        $eager_args \
        $cfg_args"
      
      if eval "$cmd" 2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

  • Video Generation
The bash script to run all video generation tasks
#!/bin/bash

# Script to run text-to-video, image-to-video, and text+image-to-video inference for all supported models
# Testing combinations of eager mode and CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni', dancing from left to right"
NEGATIVE_PROMPT="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

# Video generation parameters
HEIGHT=480
WIDTH=832
NUM_FRAMES=33
BOUNDARY_RATIO=0.875
NUM_INFERENCE_STEPS=40
FPS=16

# Input image for I2V and TI2V models
INPUT_IMAGE="./Qwen-Image/qwen-image_output_no_eager_no_cfg_parallel.png"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|flow_shift|guidance_scale|guidance_scale_high|needs_image"
declare -a MODELS=(
  "Wan2.2-T2V|Wan-AI/Wan2.2-T2V-A14B-Diffusers|12.0|4.0|4.0|no"
  "Wan2.2-I2V|Wan-AI/Wan2.2-I2V-A14B-Diffusers|12.0|5.0|6.0|yes"
  "Wan2.2-TI2V|Wan-AI/Wan2.2-TI2V-5B-Diffusers|12.0|4.0||yes"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce_eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg_parallel_size 2"
)

echo "=========================================="
echo "Starting text/image-to-video inference tests"
echo "Testing 3 models:"
echo "  - Wan2.2-T2V (text-to-video)"
echo "  - Wan2.2-I2V (image-to-video)"
echo "  - Wan2.2-TI2V (text+image-to-video)"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

# Check if input image exists for I2V and TI2V models
if [ ! -f "$INPUT_IMAGE" ]; then
  echo "⚠ Warning: Input image not found: $INPUT_IMAGE"
  echo "   I2V and TI2V models may fail. Please run text-to-image tests first."
  echo ""
fi

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path flow_shift guidance_scale guidance_scale_high needs_image <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      base_name="${base_name//./_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.mp4"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build command based on model type
      # Use image_to_video.py for I2V/TI2V models, text_to_video.py for T2V
      if [ "$needs_image" = "yes" ]; then
        # Image-to-Video or Text+Image-to-Video
        cmd="python examples/offline_inference/image_to_video/image_to_video.py \
          --model \"$model_path\" \
          --image \"$INPUT_IMAGE\" \
          --prompt \"$PROMPT\" \
          --negative_prompt \"$NEGATIVE_PROMPT\" \
          --height $HEIGHT \
          --width $WIDTH \
          --num_frames $NUM_FRAMES \
          --guidance_scale $guidance_scale"
        
        # Add guidance_scale_high if specified
        if [ -n "$guidance_scale_high" ]; then
          cmd="$cmd --guidance_scale_high $guidance_scale_high"
        fi
        
        cmd="$cmd \
          --boundary_ratio $BOUNDARY_RATIO \
          --num_inference_steps $NUM_INFERENCE_STEPS \
          --flow_shift $flow_shift \
          --fps $FPS \
          --output \"$output_file\" \
          $eager_args \
          $cfg_args"
      else
        # Text-to-Video
        cmd="python examples/offline_inference/text_to_video/text_to_video.py \
          --model \"$model_path\" \
          --prompt \"$PROMPT\" \
          --negative_prompt \"$NEGATIVE_PROMPT\" \
          --height $HEIGHT \
          --width $WIDTH \
          --num_frames $NUM_FRAMES \
          --guidance_scale $guidance_scale"
        
        # Add guidance_scale_high if specified
        if [ -n "$guidance_scale_high" ]; then
          cmd="$cmd --guidance_scale_high $guidance_scale_high"
        fi
        
        cmd="$cmd \
          --boundary_ratio $BOUNDARY_RATIO \
          --num_inference_steps $NUM_INFERENCE_STEPS \
          --flow_shift $flow_shift \
          --fps $FPS \
          --output \"$output_file\" \
          $eager_args \
          $cfg_args"
      fi
      
      # Execute command
      if eval "$cmd" 2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (videos and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

Test Result

  • Unit test
========================================================================= 3 passed, 3 warnings in 458.11s (0:07:38) ==========================================================================

Passed all tests.

  • Setup

    • vllm: 0.14.0
    • python: 3.12
    • pytorch: 2.9.1
    • batch size: 1
    • platform: H800 sever
  • Text-To-Image

model cfg_parallel_size time (torch.compile) time (eager) generated image
FLUX.2-KLEIN-4B 1 6.29s 8.17s flux 2-klein-4b_output_no_eager_no_cfg_parallel
FLUX.2-KLEIN-4B 2 5.15s 6.20s flux 2-klein-4b_output_no_eager_with_cfg_parallel
LongCat-Image 1 20.36s 20.38s longcat-image_output_no_eager_no_cfg_parallel
LongCat-Image 2 12.78s 12.71s longcat-image_output_no_eager_with_cfg_parallel
Ovis-Image 1 9.18s 11.85s ovis-image_output_no_eager_no_cfg_parallel
Ovis-Image 2 6.65s 7.89s ovis-image_output_no_eager_with_cfg_parallel
Qwen-Image 1 13.89s 17.80s qwen-image_output_no_eager_no_cfg_parallel
Qwen-Image 2 9.60s 11.28s qwen-image_output_no_eager_with_cfg_parallel
Stable-Diffusion-3 1 2.98s 4.92s stable-diffusion-3_output_no_eager_no_cfg_parallel
Stable-Diffusion-3 2 3.43s 4.83s stable-diffusion-3_output_no_eager_with_cfg_parallel

The speed acceleration performance of SD3 is not good.

  • Image-Edit
model cfg_parallel_size time (torch.compile) time (eager) generated image
Qwen-Image-Edit 1 35.55s 43.22s output_image_edit_no_eager_no_cfg_parallel
Qwen-Image-Edit 2 20.29s 24.21s output_image_edit_no_eager_with_cfg_parallel
Qwen-Image-Edit-2509 1 29.96s 37.17s output_image_edit_no_eager_no_cfg_parallel
Qwen-Image-Edit-2509 2 17.21s 21.76s output_image_edit_no_eager_with_cfg_parallel
LongCat-Image-Edit 1 41.11s 41.10s output_image_edit_no_eager_no_cfg_parallel
LongCat-Image-Edit 2 24.27s 23.07s output_image_edit_no_eager_with_cfg_parallel
  • Video Generation
HEIGHT=480
WIDTH=832
NUM_FRAMES=33
NUM_INFERENCE_STEPS=40
FPS=16
model cfg_parallel_size time (torch.compile) video
Wan-AI/Wan2.2-T2V-A14B-Diffusers 1 80.9s https://github.com/user-attachments/assets/561d454b-8a37-4599-8c39-914bcde28085
Wan-AI/Wan2.2-T2V-A14B-Diffusers 2 42.2s https://github.com/user-attachments/assets/5a34c7da-d957-4060-b695-3bdbc827f977
Wan-AI/Wan2.2-I2V-A14B-Diffusers 1 81.7s https://github.com/user-attachments/assets/2e5b26ff-7533-4f98-ad14-f7bb66eb340a
Wan-AI/Wan2.2-I2V-A14B-Diffusers 2 43.2s https://github.com/user-attachments/assets/4af12971-1609-4565-827d-4374e1880e16
Wan-AI/Wan2.2-TI2V-5B-Diffusers 1 13.4s https://github.com/user-attachments/assets/c06885fb-3d79-4ab4-8a5f-b1b2bd9919d8
Wan-AI/Wan2.2-TI2V-5B-Diffusers 2 10.5s https://github.com/user-attachments/assets/6b49a06a-3e5b-407f-b330-97d3f02e0c06

CC List.

@ZJY0516 @SamitHuang @david6666666 @hsliuustc0106


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@wtomin wtomin changed the title [Refactor]: CFG parallel abstraction [WIP][Refactor]: CFG parallel abstraction Jan 19, 2026
@wtomin wtomin force-pushed the cfg-base-pipeline branch 2 times, most recently from 75e8ef4 to 51891b6 Compare January 26, 2026 11:13
@david6666666 david6666666 added this to the v0.14.0 milestone Jan 27, 2026
@wtomin wtomin force-pushed the cfg-base-pipeline branch from 1f02307 to 739b668 Compare January 27, 2026 09:14
@wtomin wtomin marked this pull request as ready for review January 27, 2026 09:17
@wtomin wtomin requested a review from hsliuustc0106 as a code owner January 27, 2026 09:17
@wtomin wtomin changed the title [WIP][Refactor]: CFG parallel abstraction [Refactor]: CFG parallel abstraction Jan 27, 2026
@wtomin wtomin changed the title [Refactor]: CFG parallel abstraction [Perf]: CFG parallel abstraction Jan 27, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 739b668791

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a shared classifier-free guidance (CFG) parallelization abstraction via CFGParallelMixin (and QwenImageCFGParallelMixin) and refactors multiple diffusion pipelines and examples to use it, enabling rank-split conditional/unconditional denoising across a dedicated CFG process group. It also wires CFG-parallel configuration into the offline video examples and updates the user documentation to describe and advertise CFG-Parallel support for the relevant models.

Changes:

  • Add CFGParallelMixin and QwenImageCFGParallelMixin implementing reusable predict_noise_maybe_with_cfg and scheduler_step_maybe_with_cfg helpers for both sequential and CFG-parallel execution.
  • Refactor image and video diffusion pipelines (Qwen-Image*, LongCat-Image*, Ovis-Image, Flux2-Klein, Wan2.2 T2V/I2V/TI2V, Stable-Diffusion-3) to use the new mixins instead of ad-hoc CFG logic, preserving editing-specific slicing and normalization behaviors.
  • Extend offline text-to-video and image-to-video examples and the diffusion acceleration docs to expose cfg_parallel_size, describe CFG-Parallel usage, and mark supported models appropriately.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_omni/diffusion/distributed/cfg_parallel.py Introduces CFGParallelMixin and QwenImageCFGParallelMixin, encapsulating CFG sequential/parallel noise prediction, combination, optional normalization, and synchronized scheduler stepping.
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py Switches QwenImagePipeline to inherit QwenImageCFGParallelMixin and delegate its diffusion loop to the shared CFG-aware diffuse implementation.
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit.py Refactors Qwen image edit pipeline to use QwenImageCFGParallelMixin.diffuse, passing image latents and enabling CFG normalization through the mixin.
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit_plus.py Same as above for the “Edit Plus” variant, delegating CFG-parallel diffusion (with normalization) to the mixin and passing attention kwargs through.
vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_layered.py Adopts QwenImageCFGParallelMixin, removing custom CFG-parallel logic and routing layered-image diffusion (with image latents and extra transformer kwargs) through the shared mixin.
vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image.py Makes LongCatImagePipeline a CFGParallelMixin user, replacing inline CFG math with predict_noise_maybe_with_cfg/scheduler_step_maybe_with_cfg and adding an overridable cfg_normalize_function plus scheduler_step wrapper.
vllm_omni/diffusion/models/longcat_image/pipeline_longcat_image_edit.py Enables CFG parallelism for LongCat image editing via CFGParallelMixin, refactors the loop to call predict_noise_maybe_with_cfg (with output slicing) and scheduler_step_maybe_with_cfg, and adds a local scheduler_step.
vllm_omni/diffusion/models/ovis_image/pipeline_ovis_image.py Refactors Ovis-Image denoising into a diffuse function using CFGParallelMixin helpers, plus a custom scheduler_step that preserves MPS dtype behavior.
vllm_omni/diffusion/models/flux2_klein/pipeline_flux2_klein.py Updates Flux.2-Klein’s loop to use CFGParallelMixin CFG handling and scheduler stepping, including slicing when image latents are concatenated (editing mode), and defines a scheduler-step wrapper.
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py Makes Wan2.2 T2V pipeline inherit CFGParallelMixin and replace inline CFG logic with predict_noise_maybe_with_cfg and scheduler_step_maybe_with_cfg, while still supporting dual-transformer guidance scales.
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py Same refactor for Wan2.2 I2V, building positive/negative kwargs (including image encoder embeds) and delegating CFG/no-CFG behavior to the mixin plus a pipeline-specific predict_noise.
vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py Same pattern for Wan2.2 TI2V, with patch-wise timesteps and a local predict_noise helper used by the mixin.
vllm_omni/diffusion/models/sd3/pipeline_sd3.py Makes SD3 pipeline a CFGParallelMixin user, introduces a dedicated diffuse method that calls predict_noise_maybe_with_cfg/scheduler_step_maybe_with_cfg, and wires forward through this method.
examples/offline_inference/text_to_video/text_to_video.py Imports DiffusionParallelConfig, adds --cfg_parallel_size CLI flag, includes it in DiffusionParallelConfig, and passes the parallel config plus enforce_eager into Omni.
examples/offline_inference/image_to_video/image_to_video.py Adds DiffusionParallelConfig, --cfg_parallel_size, and --enforce_eager support; constructs parallel_config with the requested CFG parallel size and passes it into Omni.
docs/user_guide/diffusion_acceleration.md Updates acceleration support tables to mark LongCat, Ovis, SD3, and Wan2.2 as CFG-Parallel capable and extends the VideoGen table with a CFG-Parallel column.
docs/user_guide/diffusion/parallelism_acceleration.md Rewrites the CFG-Parallel section to use CFGParallelMixin/QwenImageCFGParallelMixin as the canonical examples, documenting predict_noise_maybe_with_cfg, scheduler_step_maybe_with_cfg, and customization points like predict_noise and cfg_normalize_function.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wtomin wtomin force-pushed the cfg-base-pipeline branch from 06a074b to 921223e Compare January 28, 2026 09:14
@david6666666 david6666666 added the high priority high priority issue, needs to be done asap label Jan 28, 2026
@wtomin wtomin force-pushed the cfg-base-pipeline branch from fea1c6f to fe6728c Compare January 29, 2026 02:03
"--cfg_parallel_size",
type=int,
default=1,
choices=[1, 2],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of setting CFG parallel size to 1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this cfg size checking should also be done in the CFG parallelism implementation, not just the offline examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CFG parallel default size is 1, because in vll_omni/diffusion/data.py, world_size is defined as a product of multiple parallel sizes:

        self.world_size = (
            self.pipeline_parallel_size
            * self.data_parallel_size
            * self.tensor_parallel_size
            * self.ulysses_degree
            * self.ring_degree
            * self.cfg_parallel_size
        )

Besides, I revised data.py to check cfg size in configuration.

@wtomin wtomin force-pushed the cfg-base-pipeline branch from fc5ba2f to b5d7733 Compare January 29, 2026 08:55
@wtomin wtomin force-pushed the cfg-base-pipeline branch from 5ce0e19 to 250a4c1 Compare January 30, 2026 02:49
@SamitHuang SamitHuang added the ready label to trigger buildkite CI label Jan 30, 2026
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 30, 2026

@wtomin Should we also add a e2e test using riverclouds/qwen_image_random

@wtomin
Copy link
Contributor Author

wtomin commented Jan 30, 2026

@wtomin Should we also add a e2e test using riverclouds/qwen_image_random

Maybe slow. I asked @Gaohan123 and the conclusion is that we should use unit test for now.

# Compute the previous noisy sample x_t -> x_t-1 with automatic CFG sync
latents = self.scheduler_step_maybe_with_cfg(noise_pred, t, latents, do_true_cfg)

if torch.cuda.is_available():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZJY0516 @SamitHuang I have moved the per-step torch.cuda.empty_cache() call to the pipeline of Wan2.2 series models.

Now it works fine with 480x832x33 resolution, and only call it once.

Copy link
Collaborator

@ZJY0516 ZJY0516 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid it will break other platform, for example, could non-CUDA platforms run into OOM errors? And We'd better add a comment for why we do this here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Changed torch.cuda.empty_cache() to current_omni_platform.empty_cache() and add a comment on why adding it here.

wtomin added 21 commits January 30, 2026 17:46
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
@wtomin wtomin force-pushed the cfg-base-pipeline branch from d6a8fdf to 5a9af70 Compare January 30, 2026 09:46
@hsliuustc0106 hsliuustc0106 enabled auto-merge (squash) January 30, 2026 15:30
@hsliuustc0106 hsliuustc0106 merged commit 23cf48d into vllm-project:main Jan 30, 2026
7 checks passed
dongbo910220 pushed a commit to dongbo910220/vllm-omni that referenced this pull request Feb 1, 2026
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
@wtomin wtomin deleted the cfg-base-pipeline branch February 2, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants