Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,16 @@ nav:
- Configuration:
- configuration/README.md
- configuration/*
- Diffusion Acceleration:
- Overview: user_guide/diffusion_acceleration.md
- Acceleration Methods:
- TeaCache: user_guide/acceleration/teacache.md
- Cache-DiT: user_guide/acceleration/cache_dit_acceleration.md
- Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md
- Models:
- models/supported_models.md
- Features:
- Sleep Mode: features/sleep_mode.md
- CPU Offloading for Diffusion Model: features/cpu_offload_diffusion.md
- Diffusion Features:
- Overview: user_guide/diffusion_acceleration.md
- TeaCache: user_guide/diffusion/teacache.md
- Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- Developer Guide:
- General:
- contributing/README.md
Expand Down
1 change: 1 addition & 0 deletions docs/api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ Model execution components.
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_code_predictor_mtp.Qwen3OmniMoeTalkerCodePredictor][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeModel][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerForConditionalGeneration][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerSharedExpertWrapper][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMForCausalLM][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMModel][]
- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeConditionalGenerationMixin][]
Expand Down
6 changes: 3 additions & 3 deletions docs/configuration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ For introduction, please check [Introduction for stage config](./stage_configs.m

## Optimization Features

- **[TeaCache Configuration](../user_guide/acceleration/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
- **[Cache-DiT Configuration](../user_guide/acceleration/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models
- **[Parallelism Configuration](../user_guide/acceleration/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models
- **[TeaCache Configuration](../user_guide/diffusion/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
- **[Cache-DiT Configuration](../user_guide/diffusion/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models
- **[Parallelism Configuration](../user_guide/diffusion/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ if __name__ == "__main__":
m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
```

- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint.
- **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.

## Known Limitations
- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)
18 changes: 9 additions & 9 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ vLLM-Omni supports various cache acceleration methods to speed up diffusion mode

vLLM-Omni currently supports two main cache acceleration backends:

1. **[TeaCache](acceleration/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
2. **[Cache-DiT](acceleration/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
1. **[TeaCache](diffusion/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
2. **[Cache-DiT](diffusion/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
- **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
- **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
- **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
Expand All @@ -16,11 +16,11 @@ Both methods can provide significant speedups (typically **1.5x-2.0x**) while ma

vLLM-Omni also supports parallelism methods for diffusion models, including:

1. [Ulysses-SP](acceleration/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.

2. [Ring-Attention](acceleration/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded.
2. [Ring-Attention](diffusion/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded.

3. [CFG-Parallel](acceleration/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.
3. [CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.

## Quick Comparison

Expand Down Expand Up @@ -197,7 +197,7 @@ outputs = omni.generate(prompt="turn this cat to a dog",

For detailed information on each acceleration method:

- **[TeaCache Guide](acceleration/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](acceleration/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[Sequence Parallelism](acceleration/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
- **[CFG-Parallel](acceleration/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.