Add diffusion LoRA request path and worker cache#657
Add diffusion LoRA request path and worker cache#657dongbo910220 wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
vllm_omni/diffusion/data.py
Outdated
| max_lora_cache_vram: float = 4.0 # GiB per worker | ||
| max_lora_cache_cpu: float = 8.0 # GiB per worker (placeholder for future CPU caching) |
There was a problem hiding this comment.
why and how to tune max_lora_cache_vram and max_lora_cache_cpu?
There was a problem hiding this comment.
same question. Why we need this
There was a problem hiding this comment.
@dongbo910220 please take a look. vllm appears to use a simple count-based eviction so I think these are not really needed. Same for lora_evict_interval below.
There was a problem hiding this comment.
Done. Switched to count-based LRU to align with vLLM. PTAL.
There was a problem hiding this comment.
Is it better to re-use (or inherit) the LoRAModelManager, LRUCacheLoRAModelManager, and WorkerLoRAManager in vLLM?
There was a problem hiding this comment.
Based on the current vLLM implementation, defining a separate set of managers is more appropriate since
WorkerLoRAManageris closely coupled with LLM-specific initialization (embedding/vocab_size), which makes direct reuse less suitable in the diffusion context.- If we inherite from the vLLM managers, we will need to override / rewrite the
add_adpterrelated LoRA handling logic. Current vLLM returns a boolean while in vLLM-Omnigpu_worker.pydict format response{"status": "error", "error": str(e)}is expected for rpc.
The current implementation from @dongbo910220 works on linear LoRA. PEFT may need to be incorporated to stay consistent with vLLM and to enable greater flexibility. The PEFT-related logic may also differ from the base vLLM.
|
ZJY0516
left a comment
There was a problem hiding this comment.
what blocks supporting custom kernels like QKVParallelLinear?
vllm_omni/diffusion/data.py
Outdated
| max_lora_cache_vram: float = 4.0 # GiB per worker | ||
| max_lora_cache_cpu: float = 8.0 # GiB per worker (placeholder for future CPU caching) |
There was a problem hiding this comment.
same question. Why we need this
I am working on an implementation integrating peft & vllm lora custom modules (which add QKVParallelLinear support). Will be merged into this work once ready. |
Thanks. If supporting the linear layer within vLLM isn't feasible, then implementing it on the Omni diffusion side would be a significant burden |
Signed-off-by: dongbo910220 <1275604947@qq.com>
Thanks for the suggestion! Will add:
|
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: AndyZhou952 <jzhoubc@connect.usk.hk>
|
I think a |
|
Please refer to #758 for the PEFT design due to the huge amount of refactoring for PEFT adaptation. Tentatively removed whitelist support to be consistent with the vLLM behavior. |
Following the discussion with @AndyZhou952, this PR implements the initial LoRA support framework.
Purpose
Fixes #281
Add request-level dynamic LoRA support for diffusion models (SD3.5/SDXL). This enables:
Test Plan
python -m vllm_omni.entrypoints.openai.api_server
--model stabilityai/stable-diffusion-3.5-large
--lora-dirs /path/to/lora-test
curl -X POST http://localhost:8000/v1/images/generations
-H "Content-Type: application/json"
-d '{"prompt": "a boy", "lora": {"name": "rafadan", "local_path": "/path/to/lora.safetensors", "scale": 0.8}}'
Test Result
Limitations (Future Work)
Co-authored-by: AndyZhou952
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)