Skip to content

[feature] cpu offloading support for diffusion#497

Merged
ZJY0516 merged 19 commits intovllm-project:mainfrom
LawJarp-A:feature/cpu_offloading_support
Jan 12, 2026
Merged

[feature] cpu offloading support for diffusion#497
ZJY0516 merged 19 commits intovllm-project:mainfrom
LawJarp-A:feature/cpu_offloading_support

Conversation

@LawJarp-A
Copy link
Contributor

@LawJarp-A LawJarp-A commented Dec 27, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

FIX #412

Implementation

Uses a mutual-exclusion swap pattern via PyTorch forward pre-hooks:

  • DiT (transformer) and encoders alternate GPU access
  • Before encoder forward: DiT → CPU, encoder → GPU
  • Before DiT forward: encoders → CPU, DiT → GPU

Key features:

  • Single config flag: dit_cpu_offload=True
  • Zero pipeline code changes required - hooks handle device placement automatically
  • Pinned CPU memory for faster PCIe transfers (pin_cpu_memory=True by default)
  • Compatible with TeaCache and Cache-DiT

Files changed:

  • vllm_omni/diffusion/offload.py - New file with SequentialOffloader class
  • vllm_omni/diffusion/data.py - Simplified config to single dit_cpu_offload flag
  • vllm_omni/diffusion/worker/gpu_worker.py - Hook application after model load

TODO

  • Try to fix some model loading is very slow when enabling offloading
  • add tests for this
  • add doc

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@LawJarp-A LawJarp-A changed the title Feature/cpu offloading support Feature/cpu offloading support (wip) Dec 27, 2025
@hsliuustc0106 hsliuustc0106 changed the title Feature/cpu offloading support (wip) [WIP] Feature/cpu offloading support Dec 27, 2025
@LawJarp-A LawJarp-A mentioned this pull request Dec 27, 2025
1 task
@LawJarp-A LawJarp-A force-pushed the feature/cpu_offloading_support branch from 3464254 to aed5d72 Compare December 27, 2025 09:33
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 27, 2025

Could you please test the cache functionality(cache-dit and tea-cache) when offload enabled?

@LawJarp-A
Copy link
Contributor Author

Could you please test the cache functionality(cache-dit and tea-cache) when offload enabled?

Tested it @ZJY0516
Is this a agreeable approach for you rather than adding it as a hook like TeaCache?

Category Config Time (s) Speedup vs Baseline
No Cache no_cache_no_offload 4.06 baseline
No Cache no_cache_with_offload 14.20 0.29×
Cache-DiT cache_dit_no_offload 2.51 1.62×
Cache-DiT cache_dit_with_offload 10.70 1.33× vs offload
TeaCache teacache_no_offload 3.22 1.26×
TeaCache teacache_with_offload 11.63 1.22× vs offload

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 30, 2025

I was wondering what if we want to overlap computation and offload?


def _pre_forward_hook(self, module: nn.Module, args: tuple) -> None:
"""Move module to GPU before forward."""
module.to(self.execution_device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's a synchronized and blocking transfer from CPU to GPU. Will it be more efficient to use async transfers with CUDA streams?

@LawJarp-A LawJarp-A force-pushed the feature/cpu_offloading_support branch 2 times, most recently from 920e8b5 to ba54241 Compare January 6, 2026 05:46
@SamitHuang SamitHuang mentioned this pull request Jan 7, 2026
41 tasks
@hsliuustc0106
Copy link
Collaborator

please check whether this blog will help https://zhuanlan.zhihu.com/p/1986157623695922659

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 9, 2026

Given the high demand for offloading, I think we can leave efficient overlap in following pr. @LawJarp-A @SamitHuang @hsliuustc0106

Copy link
Collaborator

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any evidence that text encoder is offloaded when I run it locally.

from vllm_omni import Omni

if __name__ == "__main__":

    m = Omni(model="Qwen/Qwen-Image",text_encoder_cpu_offload=True)

    outputs = m.generate(
        "a photo of a cat sitting on a laptop keyboard",
        height=1024,
        width=1024,
        num_inference_steps=50,
        num_outputs_per_prompt=1,
    )

@bwyangseek
Copy link

That makes sense. I’ve previously worked on overlapping GPU compute with CPU communication using a dual-thread, multi-stream, and thread-switching approach. A similar micro-batch + async model could help overlap offloading with compute instead of serializing steps.
Happy to explore this in a follow-up PR once the current one lands.
Could you please clarify if handling this overlap falls under this PR, the roadmap, or should be a separate task? @LawJarp-A @SamitHuang @hsliuustc0106

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 10, 2026

It should be a separate task. @bwyangseek

@ZJY0516 ZJY0516 force-pushed the feature/cpu_offloading_support branch 3 times, most recently from eeeafc6 to ca34996 Compare January 10, 2026 12:30
@ZJY0516 ZJY0516 force-pushed the feature/cpu_offloading_support branch from ca34996 to a388b03 Compare January 10, 2026 13:04
Signed-off-by: zjy0516 <[email protected]>
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 10, 2026

The main problem is model loading is very slow when enabling offloading

Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Copy link
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the nice work.
what kind of models are tested? are t2i, i2i and t2v all covered?

- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint.

## Known Limitations
- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason of latency? is it proportional the model size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the root cause is CPU-side model initialization time, which seems proportional to model size(not sure). While Z-Image incurs only a small delay, Qwen-Image's much larger size results in over a minute of initialization latency.

@ZJY0516 ZJY0516 requested a review from SamitHuang January 11, 2026 13:13
Comment on lines +190 to +198
for dit_mod in dit_modules:
dit_mod.to("cpu")
if torch.cuda.is_available():
torch.cuda.empty_cache()
if pin and torch.cuda.is_available():
for dit_mod in dit_modules:
for p in dit_mod.parameters():
if p.data.device.type == "cpu" and not p.data.is_pinned():
p.data = p.data.pin_memory()
Copy link
Contributor

@yuanheng-zhao yuanheng-zhao Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note for this part:

If DiT modules are initialized on GPU first (not the current case), here it comes two copies of parameters happening on CPU: GPU params -> CPU params -> CPU pinned params. Creation of empty tensor with pin_memory =True and then in-place copy from existing param data will be a method to ensure a single-copy path, for the case the dit modules are not residing on cpu at some moment.

For now, it does not make a diff, since the usage in gpu worker ensures the model loader to load dit modules on cpu directly when dit_cpu_offload is enabled:

class GPUWorker:
    def init_device_and_model(self) -> None:
        ...
        load_device = "cpu" if self.od_config.dit_cpu_offload else str(self.device)
        ...
                    self.pipeline = model_loader.load_model(
                        od_config=self.od_config,
                        load_device=load_device,
                    )

cc @ZJY0516

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. Initializing on the CPU is currently a necessary workaround due to our architecture, which requires all components to be initialized simultaneously. This can lead to significant slowness in some cases. To address this, we plan to redesign the architecture to allow per-component GPU initialization, followed by selective offloading to the CPU

@hsliuustc0106
Copy link
Collaborator

Please attach your design doc using this template in your RFC :)

Signed-off-by: zjy0516 <[email protected]>
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 12, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd7c1c1b4c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: zjy0516 <[email protected]>
@ZJY0516 ZJY0516 added the ready label to trigger buildkite CI label Jan 12, 2026
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
@ZJY0516 ZJY0516 enabled auto-merge (squash) January 12, 2026 10:39
@ZJY0516 ZJY0516 merged commit 779f598 into vllm-project:main Jan 12, 2026
6 of 7 checks passed
m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
```

- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to enable-cpu-offload

qibaoyuan pushed a commit to qibaoyuan/vllm-omni that referenced this pull request Jan 12, 2026
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: Prajwal A <[email protected]>
Co-authored-by: zjy0516 <[email protected]>
Signed-off-by: 齐保元 <[email protected]>
sniper35 pushed a commit to sniper35/vllm-omni that referenced this pull request Jan 14, 2026
@bwyangseek
Copy link

bwyangseek commented Jan 15, 2026

It should be a separate task. @bwyangseek

Thank you for your kind reply!I have proposed a PR #798 based on the previous discussions to implement overlapping, with improvements and optimizations. The focus is on the DiT computation, where instead of loading the entire DiT model onto the GPU as in #497, the DiT is split into block-level execution. During the computation of one block, the weights for the next block are asynchronously prefetched using independent CUDA streams to achieve compute/copy overlap, thus improving GPU utilization rather than just reducing peak memory usage. Additionally, I plan to inherit part of the ideas from #497 and modify the SequentialOffloader such that after the encoders finish computation, DiT is not fully loaded onto the GPU but performs an overlap forward.

Currently, the code has a preliminary framework, which has been submitted in PR #798, but I am still refining the logic, code, and conducting tests. I would appreciate your feedback on this approach. Thank you! @ZJY0516

with1015 pushed a commit to with1015/vllm-omni that referenced this pull request Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: CPU offloading support

7 participants