Skip to content

Comments

Add diffusion LoRA request path and worker cache#657

Closed
dongbo910220 wants to merge 4 commits intovllm-project:mainfrom
dongbo910220:lora_support
Closed

Add diffusion LoRA request path and worker cache#657
dongbo910220 wants to merge 4 commits intovllm-project:mainfrom
dongbo910220:lora_support

Conversation

@dongbo910220
Copy link
Contributor

Following the discussion with @AndyZhou952, this PR implements the initial LoRA support framework.

Purpose

Fixes #281

Add request-level dynamic LoRA support for diffusion models (SD3.5/SDXL). This enables:

  • Load/unload adapters at runtime without service restart
  • Per-worker LRU cache with configurable VRAM budget
  • Path whitelist for security

Test Plan

  1. Start server with SD3.5 and LoRA whitelist:
    python -m vllm_omni.entrypoints.openai.api_server
    --model stabilityai/stable-diffusion-3.5-large
    --lora-dirs /path/to/lora-test
  2. Send request with LoRA:
    curl -X POST http://localhost:8000/v1/images/generations
    -H "Content-Type: application/json"
    -d '{"prompt": "a boy", "lora": {"name": "rafadan", "local_path": "/path/to/lora.safetensors", "scale": 0.8}}'

Test Result

  • Server started successfully with LoRA whitelist configured
    • Request returned 200, image saved
    • Logs confirm LoRA path validation passed and adapter loaded
    • No OOM or exceptions observed

Limitations (Future Work)

  • Only supports nn.Linear layers (custom kernels like QKVParallelLinear not patched)
  • Single LoRA per request (multi-LoRA composition not yet supported)
  • No background eviction scheduler (on-demand only)

Co-authored-by: AndyZhou952


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Comment on lines 287 to 288
max_lora_cache_vram: float = 4.0 # GiB per worker
max_lora_cache_cpu: float = 8.0 # GiB per worker (placeholder for future CPU caching)
Copy link
Collaborator

@SamitHuang SamitHuang Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why and how to tune max_lora_cache_vram and max_lora_cache_cpu?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question. Why we need this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongbo910220 please take a look. vllm appears to use a simple count-based eviction so I think these are not really needed. Same for lora_evict_interval below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched to count-based LRU to align with vLLM. PTAL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to re-use (or inherit) the LoRAModelManager, LRUCacheLoRAModelManager, and WorkerLoRAManager in vLLM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current vLLM implementation, defining a separate set of managers is more appropriate since

  1. WorkerLoRAManager is closely coupled with LLM-specific initialization (embedding/vocab_size), which makes direct reuse less suitable in the diffusion context.
  2. If we inherite from the vLLM managers, we will need to override / rewrite the add_adpter related LoRA handling logic. Current vLLM returns a boolean while in vLLM-Omni gpu_worker.py dict format response {"status": "error", "error": str(e)} is expected for rpc.

The current implementation from @dongbo910220 works on linear LoRA. PEFT may need to be incorporated to stay consistent with vLLM and to enable greater flexibility. The PEFT-related logic may also differ from the base vLLM.

@SamitHuang
Copy link
Collaborator

SamitHuang commented Jan 6, 2026

  1. seems this pr mainly considers dynamic lora loading, whitelist lora_dirs in server and loar path in request are required to config. I think static lora loading is also commonly used and should be supported in the first version, where we just need to set lora_path in server.

  2. please add an example script or the related usgae doc

@SamitHuang SamitHuang mentioned this pull request Jan 7, 2026
41 tasks
@hsliuustc0106 hsliuustc0106 requested a review from ZJY0516 January 7, 2026 13:23
Copy link
Collaborator

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what blocks supporting custom kernels like QKVParallelLinear?

Comment on lines 287 to 288
max_lora_cache_vram: float = 4.0 # GiB per worker
max_lora_cache_cpu: float = 8.0 # GiB per worker (placeholder for future CPU caching)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question. Why we need this

@AndyZhou952
Copy link
Contributor

what blocks supporting custom kernels like QKVParallelLinear?

I am working on an implementation integrating peft & vllm lora custom modules (which add QKVParallelLinear support). Will be merged into this work once ready.

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 8, 2026

what blocks supporting custom kernels like QKVParallelLinear?

I am working on an implementation integrating peft & vllm lora custom modules (which add QKVParallelLinear support). Will be merged into this work once ready.

Thanks. If supporting the linear layer within vLLM isn't feasible, then implementing it on the Omni diffusion side would be a significant burden

Signed-off-by: dongbo910220 <1275604947@qq.com>
@dongbo910220
Copy link
Contributor Author

dongbo910220 commented Jan 8, 2026

  1. seems this pr mainly considers dynamic lora loading, whitelist lora_dirs in server and loar path in request are required to config. I think static lora loading is also commonly used and should be supported in the first version, where we just need to set lora_path in server.
  2. please add an example script or the related usgae doc

Thanks for the suggestion! Will add:

  1. Static LoRA loading: Add --lora-path option for single LoRA at server startup
  2. Example script: Add usage examples for both static and dynamic loading modes

Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: AndyZhou952 <jzhoubc@connect.usk.hk>
@zhtmike
Copy link

zhtmike commented Jan 9, 2026

I think a peft-compatible integration should be considered, including support for loading PEFT configs, weights, and related components. So far, the peft module has been the most popular way to integrate with LoRA on the Hugging Face Hub, and we should not overlook it.

@AndyZhou952
Copy link
Contributor

Please refer to #758 for the PEFT design due to the huge amount of refactoring for PEFT adaptation. Tentatively removed whitelist support to be consistent with the vLLM behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: LoRA adapter support for vLLM alignment

5 participants