[feature] cpu offloading support for diffusion#497
[feature] cpu offloading support for diffusion#497ZJY0516 merged 19 commits intovllm-project:mainfrom
Conversation
3464254 to
aed5d72
Compare
|
Could you please test the cache functionality(cache-dit and tea-cache) when offload enabled? |
Tested it @ZJY0516
|
|
I was wondering what if we want to overlap computation and offload? |
vllm_omni/diffusion/offload.py
Outdated
|
|
||
| def _pre_forward_hook(self, module: nn.Module, args: tuple) -> None: | ||
| """Move module to GPU before forward.""" | ||
| module.to(self.execution_device) |
There was a problem hiding this comment.
seems it's a synchronized and blocking transfer from CPU to GPU. Will it be more efficient to use async transfers with CUDA streams?
920e8b5 to
ba54241
Compare
|
please check whether this blog will help https://zhuanlan.zhihu.com/p/1986157623695922659 |
|
Given the high demand for offloading, I think we can leave efficient overlap in following pr. @LawJarp-A @SamitHuang @hsliuustc0106 |
There was a problem hiding this comment.
I didn't see any evidence that text encoder is offloaded when I run it locally.
from vllm_omni import Omni
if __name__ == "__main__":
m = Omni(model="Qwen/Qwen-Image",text_encoder_cpu_offload=True)
outputs = m.generate(
"a photo of a cat sitting on a laptop keyboard",
height=1024,
width=1024,
num_inference_steps=50,
num_outputs_per_prompt=1,
)|
That makes sense. I’ve previously worked on overlapping GPU compute with CPU communication using a dual-thread, multi-stream, and thread-switching approach. A similar micro-batch + async model could help overlap offloading with compute instead of serializing steps. |
|
It should be a separate task. @bwyangseek |
eeeafc6 to
ca34996
Compare
Co-authored-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]> Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
ca34996 to
a388b03
Compare
Signed-off-by: zjy0516 <[email protected]>
|
The main problem is model loading is very slow when enabling offloading |
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
SamitHuang
left a comment
There was a problem hiding this comment.
thanks for the nice work.
what kind of models are tested? are t2i, i2i and t2v all covered?
| - **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint. | ||
|
|
||
| ## Known Limitations | ||
| - Cold start latency increases for over one minute for some models(e.g., Qwen-Image) |
There was a problem hiding this comment.
What is the reason of latency? is it proportional the model size?
There was a problem hiding this comment.
I suspect the root cause is CPU-side model initialization time, which seems proportional to model size(not sure). While Z-Image incurs only a small delay, Qwen-Image's much larger size results in over a minute of initialization latency.
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
| for dit_mod in dit_modules: | ||
| dit_mod.to("cpu") | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() | ||
| if pin and torch.cuda.is_available(): | ||
| for dit_mod in dit_modules: | ||
| for p in dit_mod.parameters(): | ||
| if p.data.device.type == "cpu" and not p.data.is_pinned(): | ||
| p.data = p.data.pin_memory() |
There was a problem hiding this comment.
A note for this part:
If DiT modules are initialized on GPU first (not the current case), here it comes two copies of parameters happening on CPU: GPU params -> CPU params -> CPU pinned params. Creation of empty tensor with pin_memory =True and then in-place copy from existing param data will be a method to ensure a single-copy path, for the case the dit modules are not residing on cpu at some moment.
For now, it does not make a diff, since the usage in gpu worker ensures the model loader to load dit modules on cpu directly when dit_cpu_offload is enabled:
class GPUWorker:
def init_device_and_model(self) -> None:
...
load_device = "cpu" if self.od_config.dit_cpu_offload else str(self.device)
...
self.pipeline = model_loader.load_model(
od_config=self.od_config,
load_device=load_device,
)cc @ZJY0516
There was a problem hiding this comment.
Thanks for the reminder. Initializing on the CPU is currently a necessary workaround due to our architecture, which requires all components to be initialized simultaneously. This can lead to significant slowness in some cases. To address this, we plan to redesign the architecture to allow per-component GPU initialization, followed by selective offloading to the CPU
|
Please attach your design doc using this template in your RFC :) |
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd7c1c1b4c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
| m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True) | ||
| ``` | ||
|
|
||
| - **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint. |
There was a problem hiding this comment.
change to enable-cpu-offload
Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]> Signed-off-by: 齐保元 <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]>
Thank you for your kind reply!I have proposed a PR #798 based on the previous discussions to implement overlapping, with improvements and optimizations. The focus is on the DiT computation, where instead of loading the entire DiT model onto the GPU as in #497, the DiT is split into block-level execution. During the computation of one block, the weights for the next block are asynchronously prefetched using independent CUDA streams to achieve compute/copy overlap, thus improving GPU utilization rather than just reducing peak memory usage. Additionally, I plan to inherit part of the ideas from #497 and modify the SequentialOffloader such that after the encoders finish computation, DiT is not fully loaded onto the GPU but performs an overlap forward. Currently, the code has a preliminary framework, which has been submitted in PR #798, but I am still refining the logic, code, and conducting tests. I would appreciate your feedback on this approach. Thank you! @ZJY0516 |
Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Co-authored-by: zjy0516 <[email protected]>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
FIX #412
Implementation
Uses a mutual-exclusion swap pattern via PyTorch forward pre-hooks:
Key features:
Files changed:
TODO
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)