Add diffusion LoRA request path and worker cache by dongbo910220 · Pull Request #657 · vllm-project/vllm-omni

dongbo910220 · 2026-01-05T19:28:04Z

Following the discussion with @AndyZhou952, this PR implements the initial LoRA support framework.

Purpose

Fixes #281

Add request-level dynamic LoRA support for diffusion models (SD3.5/SDXL). This enables:

Load/unload adapters at runtime without service restart
Per-worker LRU cache with configurable VRAM budget
Path whitelist for security

Test Plan

Start server with SD3.5 and LoRA whitelist:
python -m vllm_omni.entrypoints.openai.api_server
--model stabilityai/stable-diffusion-3.5-large
--lora-dirs /path/to/lora-test
Send request with LoRA:
curl -X POST http://localhost:8000/v1/images/generations
-H "Content-Type: application/json"
-d '{"prompt": "a boy", "lora": {"name": "rafadan", "local_path": "/path/to/lora.safetensors", "scale": 0.8}}'

Test Result

Server started successfully with LoRA whitelist configured
- Request returned 200, image saved
- Logs confirm LoRA path validation passed and adapter loaded
- No OOM or exceptions observed

Limitations (Future Work)

Only supports nn.Linear layers (custom kernels like QKVParallelLinear not patched)
Single LoRA per request (multi-LoRA composition not yet supported)
No background eviction scheduler (on-demand only)

Co-authored-by: AndyZhou952

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: dongbo910220 <1275604947@qq.com>

SamitHuang · 2026-01-06T02:09:42Z

vllm_omni/diffusion/data.py

+    max_lora_cache_vram: float = 4.0  # GiB per worker
+    max_lora_cache_cpu: float = 8.0  # GiB per worker (placeholder for future CPU caching)


why and how to tune max_lora_cache_vram and max_lora_cache_cpu?

same question. Why we need this

@dongbo910220 please take a look. vllm appears to use a simple count-based eviction so I think these are not really needed. Same for lora_evict_interval below.

Done. Switched to count-based LRU to align with vLLM. PTAL.

SamitHuang · 2026-01-06T02:18:25Z

vllm_omni/diffusion/lora/manager.py

Is it better to re-use (or inherit) the LoRAModelManager, LRUCacheLoRAModelManager, and WorkerLoRAManager in vLLM?

Based on the current vLLM implementation, defining a separate set of managers is more appropriate since

WorkerLoRAManager is closely coupled with LLM-specific initialization (embedding/vocab_size), which makes direct reuse less suitable in the diffusion context.

If we inherite from the vLLM managers, we will need to override / rewrite the add_adpter related LoRA handling logic. Current vLLM returns a boolean while in vLLM-Omni gpu_worker.py dict format response {"status": "error", "error": str(e)} is expected for rpc.

The current implementation from @dongbo910220 works on linear LoRA. PEFT may need to be incorporated to stay consistent with vLLM and to enable greater flexibility. The PEFT-related logic may also differ from the base vLLM.

SamitHuang · 2026-01-06T02:31:11Z

seems this pr mainly considers dynamic lora loading, whitelist lora_dirs in server and loar path in request are required to config. I think static lora loading is also commonly used and should be supported in the first version, where we just need to set lora_path in server.
please add an example script or the related usgae doc

ZJY0516

what blocks supporting custom kernels like QKVParallelLinear?

ZJY0516 · 2026-01-07T13:31:41Z

vllm_omni/diffusion/data.py

+    max_lora_cache_vram: float = 4.0  # GiB per worker
+    max_lora_cache_cpu: float = 8.0  # GiB per worker (placeholder for future CPU caching)


same question. Why we need this

AndyZhou952 · 2026-01-08T02:18:18Z

what blocks supporting custom kernels like QKVParallelLinear?

I am working on an implementation integrating peft & vllm lora custom modules (which add QKVParallelLinear support). Will be merged into this work once ready.

ZJY0516 · 2026-01-08T07:08:01Z

what blocks supporting custom kernels like QKVParallelLinear?

I am working on an implementation integrating peft & vllm lora custom modules (which add QKVParallelLinear support). Will be merged into this work once ready.

Thanks. If supporting the linear layer within vLLM isn't feasible, then implementing it on the Omni diffusion side would be a significant burden

Signed-off-by: dongbo910220 <1275604947@qq.com>

dongbo910220 · 2026-01-08T17:35:01Z

seems this pr mainly considers dynamic lora loading, whitelist lora_dirs in server and loar path in request are required to config. I think static lora loading is also commonly used and should be supported in the first version, where we just need to set lora_path in server.

please add an example script or the related usgae doc

Thanks for the suggestion! Will add:

Static LoRA loading: Add --lora-path option for single LoRA at server startup
Example script: Add usage examples for both static and dynamic loading modes

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: AndyZhou952 <jzhoubc@connect.usk.hk>

zhtmike · 2026-01-09T07:05:59Z

I think a peft-compatible integration should be considered, including support for loading PEFT configs, weights, and related components. So far, the peft module has been the most popular way to integrate with LoRA on the Hugging Face Hub, and we should not overlook it.

AndyZhou952 · 2026-01-13T02:45:11Z

Please refer to #758 for the PEFT design due to the huge amount of refactoring for PEFT adaptation. Tentatively removed whitelist support to be consistent with the vLLM behavior.

dongbo910220 added 2 commits January 6, 2026 03:04

Add diffusion LoRA request path and worker cache

1de1d99

Signed-off-by: dongbo910220 <1275604947@qq.com>

Fix LoRA manager lint and OpenAI lora parsing

a98d080

Signed-off-by: dongbo910220 <1275604947@qq.com>

SamitHuang reviewed Jan 6, 2026

View reviewed changes

SamitHuang mentioned this pull request Jan 7, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

41 tasks

hsliuustc0106 requested a review from ZJY0516 January 7, 2026 13:23

ZJY0516 reviewed Jan 7, 2026

View reviewed changes

Switch LoRA cache knobs to count-based LRU

98eb106

Signed-off-by: dongbo910220 <1275604947@qq.com>

Simplify diffusion LoRA cache to count-based LRU

a94020b

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: AndyZhou952 <jzhoubc@connect.usk.hk>

david6666666 mentioned this pull request Jan 9, 2026

[Feature]: Diffusion Static and Dynamic LoRA support. JiusiServe/vllm-omni#43

Open

2 tasks

AndyZhou952 mentioned this pull request Jan 13, 2026

[Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment #758

Merged

5 tasks

knlnguyen1802 mentioned this pull request Jan 14, 2026

[RFC]: Reinforcement learning support on vllm-omni #778

Open

1 task

hsliuustc0106 mentioned this pull request Jan 20, 2026

[RFC]: Diffusion Models Features Supports Plan #814

Open

53 tasks

dongbo910220 closed this Feb 21, 2026

		max_lora_cache_vram: float = 4.0 # GiB per worker
		max_lora_cache_cpu: float = 8.0 # GiB per worker (placeholder for future CPU caching)

Comments

Conversation

dongbo910220 commented Jan 5, 2026

Purpose

Test Plan

Test Result

Limitations (Future Work)

Uh oh!

SamitHuang Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

AndyZhou952 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

dongbo910220 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

AndyZhou952 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

AndyZhou952 commented Jan 8, 2026

Uh oh!

ZJY0516 commented Jan 8, 2026

Uh oh!

dongbo910220 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhtmike commented Jan 9, 2026

Uh oh!

AndyZhou952 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SamitHuang Jan 6, 2026 •

edited

Loading

SamitHuang commented Jan 6, 2026 •

edited

Loading

dongbo910220 commented Jan 8, 2026 •

edited

Loading