[feature] cpu offloading support for diffusion by LawJarp-A · Pull Request #497 · vllm-project/vllm-omni

LawJarp-A · 2025-12-27T09:12:39Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

FIX #412

Implementation

Uses a mutual-exclusion swap pattern via PyTorch forward pre-hooks:

DiT (transformer) and encoders alternate GPU access
Before encoder forward: DiT → CPU, encoder → GPU
Before DiT forward: encoders → CPU, DiT → GPU

Key features:

Single config flag: dit_cpu_offload=True
Zero pipeline code changes required - hooks handle device placement automatically
Pinned CPU memory for faster PCIe transfers (pin_cpu_memory=True by default)
Compatible with TeaCache and Cache-DiT

Files changed:

vllm_omni/diffusion/offload.py - New file with SequentialOffloader class
vllm_omni/diffusion/data.py - Simplified config to single dit_cpu_offload flag
vllm_omni/diffusion/worker/gpu_worker.py - Hook application after model load

TODO

Try to fix some model loading is very slow when enabling offloading
add tests for this
add doc

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

ZJY0516 · 2025-12-27T14:35:34Z

Could you please test the cache functionality(cache-dit and tea-cache) when offload enabled?

LawJarp-A · 2025-12-28T08:34:01Z

Could you please test the cache functionality(cache-dit and tea-cache) when offload enabled?

Tested it @ZJY0516
Is this a agreeable approach for you rather than adding it as a hook like TeaCache?

Category	Config	Time (s)	Speedup vs Baseline
No Cache	no_cache_no_offload	4.06	baseline
No Cache	no_cache_with_offload	14.20	0.29×
Cache-DiT	cache_dit_no_offload	2.51	1.62×
Cache-DiT	cache_dit_with_offload	10.70	1.33× vs offload
TeaCache	teacache_no_offload	3.22	1.26×
TeaCache	teacache_with_offload	11.63	1.22× vs offload

vllm_omni/diffusion/registry.py

vllm_omni/diffusion/offload.py

ZJY0516 · 2025-12-30T16:48:58Z

I was wondering what if we want to overlap computation and offload?

SamitHuang · 2026-01-05T05:40:43Z

vllm_omni/diffusion/offload.py

+
+    def _pre_forward_hook(self, module: nn.Module, args: tuple) -> None:
+        """Move module to GPU before forward."""
+        module.to(self.execution_device)


seems it's a synchronized and blocking transfer from CPU to GPU. Will it be more efficient to use async transfers with CUDA streams?

hsliuustc0106 · 2026-01-07T08:48:53Z

please check whether this blog will help https://zhuanlan.zhihu.com/p/1986157623695922659

ZJY0516 · 2026-01-09T05:35:34Z

Given the high demand for offloading, I think we can leave efficient overlap in following pr. @LawJarp-A @SamitHuang @hsliuustc0106

ZJY0516

I didn't see any evidence that text encoder is offloaded when I run it locally.

from vllm_omni import Omni

if __name__ == "__main__":

    m = Omni(model="Qwen/Qwen-Image",text_encoder_cpu_offload=True)

    outputs = m.generate(
        "a photo of a cat sitting on a laptop keyboard",
        height=1024,
        width=1024,
        num_inference_steps=50,
        num_outputs_per_prompt=1,
    )

vllm_omni/diffusion/data.py

vllm_omni/diffusion/registry.py

bwyangseek · 2026-01-09T22:33:05Z

That makes sense. I’ve previously worked on overlapping GPU compute with CPU communication using a dual-thread, multi-stream, and thread-switching approach. A similar micro-batch + async model could help overlap offloading with compute instead of serializing steps.
Happy to explore this in a follow-up PR once the current one lands.
Could you please clarify if handling this overlap falls under this PR, the roadmap, or should be a separate task? @LawJarp-A @SamitHuang @hsliuustc0106

ZJY0516 · 2026-01-10T05:23:34Z

It should be a separate task. @bwyangseek

Co-authored-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]> Signed-off-by: zjy0516 <[email protected]>

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 · 2026-01-10T14:16:44Z

The main problem is model loading is very slow when enabling offloading

Signed-off-by: zjy0516 <[email protected]>

SamitHuang

thanks for the nice work.
what kind of models are tested? are t2i, i2i and t2v all covered?

SamitHuang · 2026-01-11T11:12:30Z

docs/features/cpu_offload_diffusion.md

+- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint.
+
+## Known Limitations
+- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)


What is the reason of latency? is it proportional the model size?

I suspect the root cause is CPU-side model initialization time, which seems proportional to model size(not sure). While Z-Image incurs only a small delay, Qwen-Image's much larger size results in over a minute of initialization latency.

vllm_omni/diffusion/data.py

vllm_omni/diffusion/offload.py

vllm_omni/diffusion/worker/gpu_worker.py

Signed-off-by: zjy0516 <[email protected]>

yuanheng-zhao · 2026-01-11T15:26:39Z

vllm_omni/diffusion/offload.py

+    for dit_mod in dit_modules:
+        dit_mod.to("cpu")
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    if pin and torch.cuda.is_available():
+        for dit_mod in dit_modules:
+            for p in dit_mod.parameters():
+                if p.data.device.type == "cpu" and not p.data.is_pinned():
+                    p.data = p.data.pin_memory()


A note for this part:

If DiT modules are initialized on GPU first (not the current case), here it comes two copies of parameters happening on CPU: GPU params -> CPU params -> CPU pinned params. Creation of empty tensor with pin_memory =True and then in-place copy from existing param data will be a method to ensure a single-copy path, for the case the dit modules are not residing on cpu at some moment.

For now, it does not make a diff, since the usage in gpu worker ensures the model loader to load dit modules on cpu directly when dit_cpu_offload is enabled:

class GPUWorker: def init_device_and_model(self) -> None: ... load_device = "cpu" if self.od_config.dit_cpu_offload else str(self.device) ... self.pipeline = model_loader.load_model( od_config=self.od_config, load_device=load_device, )

cc @ZJY0516

Thanks for the reminder. Initializing on the CPU is currently a necessary workaround due to our architecture, which requires all components to be initialized simultaneously. This can lead to significant slowness in some cases. To address this, we plan to redesign the architecture to allow per-component GPU initialization, followed by selective offloading to the CPU

hsliuustc0106 · 2026-01-12T05:40:34Z

Please attach your design doc using this template in your RFC :)

Signed-off-by: zjy0516 <[email protected]>

docs/features/cpu_offload_diffusion.md

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 · 2026-01-12T07:48:38Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd7c1c1b4c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/diffusion/offload.py

Signed-off-by: zjy0516 <[email protected]>

hsliuustc0106 · 2026-01-12T11:07:59Z

docs/features/cpu_offload_diffusion.md

+    m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
+```
+
+- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint.


change to enable-cpu-offload

Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]> Signed-off-by: 齐保元 <[email protected]>

Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]>

bwyangseek · 2026-01-15T09:52:42Z

It should be a separate task. @bwyangseek

Thank you for your kind reply！I have proposed a PR #798 based on the previous discussions to implement overlapping, with improvements and optimizations. The focus is on the DiT computation, where instead of loading the entire DiT model onto the GPU as in #497, the DiT is split into block-level execution. During the computation of one block, the weights for the next block are asynchronously prefetched using independent CUDA streams to achieve compute/copy overlap, thus improving GPU utilization rather than just reducing peak memory usage. Additionally, I plan to inherit part of the ideas from #497 and modify the SequentialOffloader such that after the encoders finish computation, DiT is not fully loaded onto the GPU but performs an overlap forward.

Currently, the code has a preliminary framework, which has been submitted in PR #798, but I am still refining the logic, code, and conducting tests. I would appreciate your feedback on this approach. Thank you! @ZJY0516

Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]>

LawJarp-A changed the title ~~Feature/cpu offloading support~~ Feature/cpu offloading support (wip) Dec 27, 2025

hsliuustc0106 changed the title ~~Feature/cpu offloading support (wip)~~ [WIP] Feature/cpu offloading support Dec 27, 2025

LawJarp-A mentioned this pull request Dec 27, 2025

[RFC]: CPU offloading support #412

Open

1 task

LawJarp-A force-pushed the feature/cpu_offloading_support branch from 3464254 to aed5d72 Compare December 27, 2025 09:33

ZJY0516 requested review from SamitHuang and ZJY0516 December 27, 2025 14:36

david6666666 mentioned this pull request Dec 29, 2025

[RFC]: DiT model and feature support enhancement #85

Closed

58 tasks

ZJY0516 reviewed Dec 30, 2025

View reviewed changes

vllm_omni/diffusion/registry.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/registry.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/offload.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/offload.py Outdated Show resolved Hide resolved

ZJY0516 mentioned this pull request Dec 31, 2025

feat: Add memory-efficient Edit model loading #557

Open

SamitHuang reviewed Jan 5, 2026

View reviewed changes

LawJarp-A force-pushed the feature/cpu_offloading_support branch 2 times, most recently from 920e8b5 to ba54241 Compare January 6, 2026 05:46

SamitHuang mentioned this pull request Jan 7, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

41 tasks

david6666666 mentioned this pull request Jan 9, 2026

[Feature]: Diffusion Support CPU Offloading. Standard module-wise offload (text encoder/dit/vae) JiusiServe/vllm-omni#36

Closed

2 tasks

ZJY0516 reviewed Jan 9, 2026

View reviewed changes

vllm_omni/diffusion/data.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/registry.py Outdated Show resolved Hide resolved

ZJY0516 force-pushed the feature/cpu_offloading_support branch 3 times, most recently from eeeafc6 to ca34996 Compare January 10, 2026 12:30

LawJarp-A and others added 2 commits January 10, 2026 20:54

add cpu offloading support for diffusion models

3516082

Co-authored-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]> Signed-off-by: zjy0516 <[email protected]>

Merge branch 'main' into feature/cpu_offloading_support-new

a388b03

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 force-pushed the feature/cpu_offloading_support branch from ca34996 to a388b03 Compare January 10, 2026 13:04

update

386c38e

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 added 2 commits January 11, 2026 17:36

update test

4290a1c

Signed-off-by: zjy0516 <[email protected]>

update doc

4975fa8

Signed-off-by: zjy0516 <[email protected]>

SamitHuang reviewed Jan 11, 2026

View reviewed changes

ZJY0516 added 2 commits January 11, 2026 20:19

update for compile

c4e96bf

Signed-off-by: zjy0516 <[email protected]>

update for dual-stream DiT

2c0458c

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 requested a review from SamitHuang January 11, 2026 13:13

yuanheng-zhao reviewed Jan 11, 2026

View reviewed changes

change to enable_cpu_offload

dd98233

Signed-off-by: zjy0516 <[email protected]>

hsliuustc0106 reviewed Jan 12, 2026

View reviewed changes

docs/features/cpu_offload_diffusion.md Show resolved Hide resolved

rename doc

dd7c1c1

Signed-off-by: zjy0516 <[email protected]>

chatgpt-codex-connector bot reviewed Jan 12, 2026

View reviewed changes

vllm_omni/diffusion/offload.py Show resolved Hide resolved

vllm_omni/diffusion/offload.py Show resolved Hide resolved

update

155b589

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 added the ready label to trigger buildkite CI label Jan 12, 2026

ZJY0516 added 3 commits January 12, 2026 16:12

update

b61f4a0

Signed-off-by: zjy0516 <[email protected]>

update doc

d09bea2

Signed-off-by: zjy0516 <[email protected]>

Merge branch 'main' into feature/cpu_offloading_support

ce2de00

SamitHuang approved these changes Jan 12, 2026

View reviewed changes

ZJY0516 added 2 commits January 12, 2026 17:31

update ci

01e229b

Signed-off-by: zjy0516 <[email protected]>

update ci

b80f745

Signed-off-by: zjy0516 <[email protected]>

ZJY0516 enabled auto-merge (squash) January 12, 2026 10:39

ZJY0516 merged commit 779f598 into vllm-project:main Jan 12, 2026
6 of 7 checks passed

hsliuustc0106 reviewed Jan 12, 2026

View reviewed changes

bwyangseek mentioned this pull request Jan 15, 2026

[WIP][feature] Implement Overlapping GPU Compute and CPU Offloading for Diffusion Models #798

Open

5 tasks

hsliuustc0106 mentioned this pull request Jan 20, 2026

[RFC]: Diffusion Models Features Supports Plan #814

Open

54 tasks

Conversation

LawJarp-A commented Dec 27, 2025 • edited by ZJY0516 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

TODO

Test Plan

Test Result

Uh oh!

ZJY0516 commented Dec 27, 2025

Uh oh!

LawJarp-A commented Dec 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZJY0516 commented Dec 30, 2025

Uh oh!

SamitHuang Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 7, 2026

Uh oh!

ZJY0516 commented Jan 9, 2026

Uh oh!

ZJY0516 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bwyangseek commented Jan 9, 2026

Uh oh!

ZJY0516 commented Jan 10, 2026

Uh oh!

ZJY0516 commented Jan 10, 2026

Uh oh!

SamitHuang left a comment

Choose a reason for hiding this comment

Uh oh!

SamitHuang Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanheng-zhao Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 12, 2026

Uh oh!

Uh oh!

ZJY0516 commented Jan 12, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

bwyangseek commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

LawJarp-A commented Dec 27, 2025 •

edited by ZJY0516

Loading

ZJY0516 left a comment •

edited

Loading

yuanheng-zhao Jan 11, 2026 •

edited

Loading

bwyangseek commented Jan 15, 2026 •

edited

Loading