Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
cfb19c8
refactor base pipeline and qwen_image_base_pipeline
wtomin Jan 19, 2026
ae53dca
other pipelines
wtomin Jan 19, 2026
ecb9d34
flux2 pipeline
wtomin Jan 19, 2026
a564100
cfg normalize and longcat
wtomin Jan 19, 2026
5278e61
longcat_edit pipeline
wtomin Jan 19, 2026
01a03f3
ovis pipeline
wtomin Jan 19, 2026
607aef8
sd3 pipeline
wtomin Jan 19, 2026
788d302
updates
wtomin Jan 19, 2026
d5a6182
cfg mixin
wtomin Jan 19, 2026
5253b8d
correct latent step
wtomin Jan 19, 2026
783bce4
cfg broadcast correct
wtomin Jan 19, 2026
4158cdf
cfg parallel new name
wtomin Jan 19, 2026
2224e80
update cfg parallel logic
wtomin Jan 21, 2026
5b4a378
fix flux2
wtomin Jan 21, 2026
3330153
fix flux2
wtomin Jan 21, 2026
ef6122f
fix longcat image
wtomin Jan 21, 2026
222d5fa
doc
wtomin Jan 21, 2026
107f5ce
wan_2_2 pipelines
wtomin Jan 21, 2026
d72429e
ovis image
wtomin Jan 21, 2026
be3b8ba
sd3 image
wtomin Jan 21, 2026
700fb7e
pass neg kwargs if do_true_cfg
wtomin Jan 21, 2026
a5c3dfb
flux pipeline reset to main
wtomin Jan 21, 2026
2292ccf
flux pipeline updates
wtomin Jan 21, 2026
bd25fea
reset longcat pipeline
wtomin Jan 21, 2026
fbe2b6f
updatge longcat pipeline
wtomin Jan 21, 2026
900d37f
reset longcat edit pipeline
wtomin Jan 21, 2026
901eb11
update longcat edit pipeline
wtomin Jan 21, 2026
13ff9da
Zimage pipeline update
wtomin Jan 21, 2026
b9587cb
stable audio pipeline edits
wtomin Jan 21, 2026
c2ae71d
latents .contiguous()
wtomin Jan 21, 2026
d927698
t2v cfg_parallel
wtomin Jan 26, 2026
611eb2a
cache empty
wtomin Jan 26, 2026
51566e4
correct sd3 pipeline
wtomin Jan 27, 2026
b3f0936
video script with new kwargs
wtomin Jan 27, 2026
b20cfad
revert sd audio pipeline change
wtomin Jan 27, 2026
e3e1ba2
revert zimage pipeline
wtomin Jan 27, 2026
664d76d
support list update
wtomin Jan 27, 2026
50e49b1
update how-to-parallelize-a-pipeline
wtomin Jan 27, 2026
e1ed608
support list update wan2.2
wtomin Jan 27, 2026
fbf4837
revise document head
wtomin Jan 27, 2026
cb5f770
fix t2v args
wtomin Jan 27, 2026
aa3c337
fix parameter annotation
wtomin Jan 28, 2026
444e525
fix parameter annotation
wtomin Jan 28, 2026
63f64d3
empty_cache when cuda is available()
wtomin Jan 28, 2026
1767766
test unit
wtomin Jan 28, 2026
9b94886
fix parameter annotation
wtomin Jan 28, 2026
8662f8e
update unit test
wtomin Jan 28, 2026
0b0d71d
fix pre-commit error
wtomin Jan 28, 2026
cfbd49d
fix pre-commit error
wtomin Jan 29, 2026
9345694
check cfg_parallel size in data.py
wtomin Jan 29, 2026
c8bcf2e
update cfg_parallel_size arg doc
wtomin Jan 29, 2026
35dbdeb
doc refinement
wtomin Jan 29, 2026
f3a54fe
update doc with new arg
wtomin Jan 29, 2026
8dd8e61
offline script example in doc
wtomin Jan 29, 2026
18ce884
online serving args
wtomin Jan 29, 2026
f55beb0
serve args
wtomin Jan 29, 2026
3162aac
update doc
wtomin Jan 29, 2026
da2b307
fix error
wtomin Jan 29, 2026
5189be6
remove no_grad
wtomin Jan 30, 2026
c91a3a0
remove torch.save & torch.load
wtomin Jan 30, 2026
1fdde86
update hardward devices
wtomin Jan 30, 2026
117e0de
mv QwenImageCFGParallelMixin in qwen_image folder
wtomin Jan 30, 2026
027e717
check cfg_parallel validity in pipelines
wtomin Jan 30, 2026
3579470
fix unit test spawn process error
wtomin Jan 30, 2026
6a3070b
rm mps related code
wtomin Jan 30, 2026
a939170
mv empty_cache to wan pipelines after all diffusion steps
wtomin Jan 30, 2026
5a9af70
omni_platform and comment
wtomin Jan 30, 2026
afe8c7e
Merge branch 'main' into cfg-base-pipeline
hsliuustc0106 Jan 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 150 additions & 55 deletions docs/user_guide/diffusion/parallelism_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,37 +373,60 @@ def forward(self, hidden_states, ...):

### CFG-Parallel

##### Offline Inference
#### Offline Inference

CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=...)`. The recommended configuration is `cfg_parallel_size=2` (one rank for the positive branch and one rank for the negative branch).
CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=2)`, which runs one rank for the positive branch and one rank for the negative branch.

An example of offline inference using CFG-Parallel (image-to-image) is shown below:

```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig

image_path = "path_to_image.png"
omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(cfg_parallel_size=2),
)
input_image = Image.open(image_path).convert("RGB")

outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"negative_prompt": "low quality, blurry",
"multi_modal_data": {"image": input_image},
},
OmniDiffusionSamplingParams(
true_cfg_scale=4.0,
pil_image=input_image,
num_inference_steps=50,
),
)
```

Notes:

- CFG-Parallel is only effective when **true CFG** is enabled (i.e., `true_cfg_scale > 1` and a `negative_prompt` is provided).
- CFG-Parallel is only effective when a `negative_prompt` is provided AND a guidance scale (or `cfg_scale`) is greater than 1.

See `examples/offline_inference/image_to_image/image_edit.py` for a complete working example.
```bash
cd examples/offline_inference/image_to_image/
python image_edit.py \
--model "Qwen/Qwen-Image-Edit" \
--image "qwen_image_output.png" \
--prompt "turn this cat to a dog" \
--negative_prompt "low quality, blurry" \
--cfg_scale 4.0 \
--output "edited_image.png" \
--cfg_parallel_size 2
```

#### Online Serving

You can enable CFG-Parallel in online serving for diffusion models via `--cfg-parallel-size`:

```bash
vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2
```

#### How to parallelize a pipeline

Expand All @@ -416,58 +439,130 @@ In `QwenImagePipeline`, each diffusion step runs two denoiser forward passes seq

CFG-Parallel assigns these two branches to different ranks in the **CFG group** and synchronizes the results.

Below is an example of CFG-Parallel implementation:
vLLM-omni provides `CFGParallelMixin` base class that encapsulates the CFG parallel logic. By inheriting from this mixin and calling its methods, pipelines can easily implement CFG parallel without writing repetitive code.

**Key Methods in CFGParallelMixin:**
- `predict_noise_maybe_with_cfg()`: Automatically handles CFG parallel noise prediction
- `scheduler_step_maybe_with_cfg()`: Scheduler step with automatic CFG rank synchronization

**Example Implementation:**

```python
def diffuse(
class QwenImageCFGParallelMixin(CFGParallelMixin):
"""
Base Mixin class for Qwen Image pipelines providing shared CFG methods.
"""

def diffuse(
self,
...
):
# Enable CFG-parallel: rank0 computes positive, rank1 computes negative.
cfg_parallel_ready = do_true_cfg and get_classifier_free_guidance_world_size() > 1

self.transformer.do_true_cfg = do_true_cfg

if cfg_parallel_ready:
cfg_group = get_cfg_group()
cfg_rank = get_classifier_free_guidance_rank()

if cfg_rank == 0:
local_pred = self.transformer(
hidden_states=latents,
timestep=timestep / 1000,
guidance=guidance,
encoder_hidden_states_mask=prompt_embeds_mask,
encoder_hidden_states=prompt_embeds,
img_shapes=img_shapes,
txt_seq_lens=txt_seq_lens,
attention_kwargs=self.attention_kwargs,
return_dict=False,
)[0]
else:
local_pred = self.transformer(
hidden_states=latents,
timestep=timestep / 1000,
guidance=guidance,
encoder_hidden_states_mask=negative_prompt_embeds_mask,
encoder_hidden_states=negative_prompt_embeds,
img_shapes=img_shapes,
txt_seq_lens=negative_txt_seq_lens,
attention_kwargs=self.attention_kwargs,
return_dict=False,
)[0]

gathered = cfg_group.all_gather(local_pred, separate_tensors=True)
if cfg_rank == 0:
noise_pred = gathered[0]
neg_noise_pred = gathered[1]
comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
noise_pred = comb_pred * (cond_norm / noise_norm)
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
cfg_group.broadcast(latents, src=0)
else:
# fallback: run positive then negative sequentially on one rank
...
prompt_embeds: torch.Tensor,
prompt_embeds_mask: torch.Tensor,
negative_prompt_embeds: torch.Tensor,
negative_prompt_embeds_mask: torch.Tensor,
latents: torch.Tensor,
img_shapes: torch.Tensor,
txt_seq_lens: torch.Tensor,
negative_txt_seq_lens: torch.Tensor,
timesteps: torch.Tensor,
do_true_cfg: bool,
guidance: torch.Tensor,
true_cfg_scale: float,
image_latents: torch.Tensor | None = None,
cfg_normalize: bool = True,
additional_transformer_kwargs: dict[str, Any] | None = None,
) -> torch.Tensor:
self.transformer.do_true_cfg = do_true_cfg

for i, t in enumerate(timesteps):
timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)

# Prepare kwargs for positive (conditional) prediction
positive_kwargs = {
"hidden_states": latents,
"timestep": timestep / 1000,
"guidance": guidance,
"encoder_hidden_states_mask": prompt_embeds_mask,
"encoder_hidden_states": prompt_embeds,
"img_shapes": img_shapes,
"txt_seq_lens": txt_seq_lens,
}

# Prepare kwargs for negative (unconditional) prediction
if do_true_cfg:
negative_kwargs = {
"hidden_states": latents,
"timestep": timestep / 1000,
"guidance": guidance,
"encoder_hidden_states_mask": negative_prompt_embeds_mask,
"encoder_hidden_states": negative_prompt_embeds,
"img_shapes": img_shapes,
"txt_seq_lens": negative_txt_seq_lens,
}
else:
negative_kwargs = None

# Predict noise with automatic CFG parallel handling
# - In CFG parallel mode: rank0 computes positive, rank1 computes negative
# - Automatically gathers results and combines them on rank0
noise_pred = self.predict_noise_maybe_with_cfg(
do_true_cfg=do_true_cfg,
true_cfg_scale=true_cfg_scale,
positive_kwargs=positive_kwargs,
negative_kwargs=negative_kwargs,
cfg_normalize=cfg_normalize,
)

# Step scheduler with automatic CFG synchronization
# - Only rank0 computes the scheduler step
# - Automatically broadcasts updated latents to all ranks
latents = self.scheduler_step_maybe_with_cfg(
noise_pred, t, latents, do_true_cfg
)

return latents
```

**How it works:**
1. Prepare separate `positive_kwargs` and `negative_kwargs` for conditional and unconditional predictions
2. Call `predict_noise_maybe_with_cfg()` which:
- Detects if CFG parallel is enabled (`get_classifier_free_guidance_world_size() > 1`)
- Distributes computation: rank0 processes positive, rank1 processes negative
- Gathers predictions and combines them using `combine_cfg_noise()` on rank0
- Returns combined noise prediction (only valid on rank0)
3. Call `scheduler_step_maybe_with_cfg()` which:
- Only rank0 computes the scheduler step
- Broadcasts the updated latents to all ranks for synchronization

**How to customize**

Some pipelines may need to customize the following functions in `CFGParallelMixin`:
1. You may need to edit `predict_noise` function for custom behaviors.
```python
def predict_noise(self, *args, **kwargs):
"""
Forward pass through transformer to predict noise.

Subclasses should override this if they need custom behavior,
but the default implementation calls self.transformer.
"""
return self.transformer(*args, **kwargs)[0]

```
2. The default normalization function after combining the noise predictions from both branches is as follows. You may need to customize it.
```python
def cfg_normalize_function(self, noise_pred, comb_pred):
"""
Normalize the combined noise prediction.

Args:
noise_pred: positive noise prediction
comb_pred: combined noise prediction after CFG

Returns:
Normalized noise prediction tensor
"""
cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
noise_pred = comb_pred * (cond_norm / noise_norm)
return noise_pred
```
14 changes: 7 additions & 7 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,23 +39,23 @@ The following table shows which models are currently supported by each accelerat

| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |
|-------|------------------|:----------:|:-----------:|:-----------:|:----------------:|:----------------:|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | |
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | |
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ❌ | ✅ | ✅ | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ❌ | ❌ |
| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | |
| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | |
| **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ❌ | ❌ | ❌ |

### VideoGen

| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |
|-------|------------------|:--------:|:---------:|:----------:|:--------------:|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | | |
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |
|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | | ✅ | ✅ |


## Performance Benchmarks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Key arguments:
- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
- `--prompt` / `--negative_prompt`: text description (string).
- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--output`: path to save the generated PNG.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Key arguments:
- `--num_frames`: Number of frames (default 81).
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
- `--negative_prompt`: Optional list of artifacts to suppress.
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
- `--num_inference_steps`: Number of denoising steps (default 50).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ Key arguments:
- `--prompt`: text description (string).
- `--seed`: integer seed for deterministic sampling.
- `--cfg_scale`: true CFG scale (model-specific guidance strength).
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--height/--width`: output resolution (defaults 1024x1024).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Key arguments:
- `--num_frames`: Number of frames (Wan default is 81).
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
- `--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--boundary_ratio`: Boundary split ratio for low/high DiT.
- `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: path to save the generated video.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Key arguments:
- `--output`: path to save the generated PNG.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
1 change: 1 addition & 0 deletions examples/offline_inference/image_to_video/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Key arguments:
- `--output`: Path to save the generated video.
- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
29 changes: 28 additions & 1 deletion examples/offline_inference/image_to_video/image_to_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import PIL.Image
import torch

from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
Expand Down Expand Up @@ -85,6 +86,18 @@ def parse_args() -> argparse.Namespace:
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
parser.add_argument(
"--cfg_parallel_size",
type=int,
default=1,
choices=[1, 2],
help="Number of GPUs used for classifier free guidance parallel size.",
)
parser.add_argument(
"--enforce_eager",
action="store_true",
help="Disable torch.compile and force eager execution.",
)
return parser.parse_args()


Expand Down Expand Up @@ -120,7 +133,9 @@ def main():

# Check if profiling is requested via environment variable
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

parallel_config = DiffusionParallelConfig(
cfg_parallel_size=args.cfg_parallel_size,
)
omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
Expand All @@ -130,12 +145,24 @@ def main():
boundary_ratio=args.boundary_ratio,
flow_shift=args.flow_shift,
enable_cpu_offload=args.enable_cpu_offload,
parallel_config=parallel_config,
enforce_eager=args.enforce_eager,
)

if profiler_enabled:
print("[Profiler] Starting profiling...")
omni.start_profile()

# Print generation configuration
print(f"\n{'=' * 60}")
print("Generation Configuration:")
print(f" Model: {args.model}")
print(f" Inference steps: {args.num_inference_steps}")
print(f" Frames: {args.num_frames}")
print(f" Parallel configuration: cfg_parallel_size={args.cfg_parallel_size}")
print(f" Video size: {args.width}x{args.height}")
print(f"{'=' * 60}\n")

# omni.generate() returns Generator[OmniRequestOutput, None, None]
frames = omni.generate(
{
Expand Down
1 change: 1 addition & 0 deletions examples/offline_inference/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ Key arguments:
- `--output`: path to save the generated PNG.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.

> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
Expand Down
Loading