Support sleep, wake_up and load_weights for Omni Diffusion#376
Support sleep, wake_up and load_weights for Omni Diffusion#376ZJY0516 merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: knlnguyen1802 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: knlnguyen1802 <[email protected]>
87a2ce0 to
db68abf
Compare
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Signed-off-by: knlnguyen1802 <[email protected]>
Add unit test |
Signed-off-by: knlnguyen1802 <[email protected]>
| def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: | ||
| return self.pipeline.load_weights(weights) | ||
|
|
||
| def sleep(self, level: int = 1) -> bool: |
There was a problem hiding this comment.
Do we need to add a function in engine to call this and expose an interface to user?
|
I'd like to discuss one more point. Since vLLM-Omni uses libraries like Transformers and Diffusers to load components (e.g., text encoder and VAE), does the current sleep method also handle memory allocated by these external libraries? @knlnguyen1802 |
The answer is no since sleep only handle memory that wrap in the context of _maybe_get_memory_pool_context |
|
Could we just offload model to cpu like |
For model, it already work as in this PR because I wrap the model loader in context of _maybe_get_memory_pool_context. |
Being able to call model.to('cpu') directly would make the current PR, with its new _maybe_get_memory_pool_context, seem unnecessarily complex. |
Yes, but the "CuMemAllocator" is a well define class that help to easier track how many GB of memory if offload and move back to GPU when wakeup, also keep track of time to handle these operation. Also it can have some optimization. |
Signed-off-by: knlnguyen1802 <[email protected]>
…vllm-omni into diffusion_support
Head branch was pushed to by a user without write access
|
Could you please add some doc for sleep mode? @knlnguyen1802 |
Got it I'll add in a new PR |
…ect#376) Signed-off-by: knlnguyen1802 <[email protected]>
…ect#376) Signed-off-by: knlnguyen1802 <[email protected]>
…ect#376) Signed-off-by: knlnguyen1802 <[email protected]>
…ect#376) Signed-off-by: knlnguyen1802 <[email protected]>
Purpose
Fix #316
This will support load and offload weight for Diffusion Model
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)