[Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment#758
[Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment#758david6666666 merged 65 commits intovllm-project:mainfrom
Conversation
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: Andy Zhou <[email protected]>
Co-authored-by: dongbo910220 <[email protected]> Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
|
@AndyZhou952 Thank you for the work. |
Thanks for your interest! For this part I think we can keep it consistent with the base vLLM design to add Also I think it makes sense to unite the static/dynamic support. Essentially, static support means to load lora weights at the very start, while everything else remains the same as lora support. Will update the design workflow and code base shortly to reflect the changes. update 01/13 design & code updated to reflect the changes above. PTAL. |
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Fix diffusion weight index path for subfolders Signed-off-by: Andy Zhou <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
| return self.pipeline.load_weights(weights) | ||
|
|
||
| def remove_lora(self, adapter_id: int) -> bool: | ||
| return self.lora_manager.remove_adapter(adapter_id) if self.lora_manager else False |
There was a problem hiding this comment.
Do we need to guarantee self.lora_manager is not None?
There was a problem hiding this comment.
Not needed here & removed condition checks. Thanks
vllm_omni/diffusion/lora/manager.py
Outdated
| max_num_batched_tokens=max_num_batched_tokens, | ||
| max_batches=1, # single request | ||
| device=self.device, | ||
| max_loras=1, # single lora |
There was a problem hiding this comment.
QQ:
- Do SD models generally only have one concurrent LoRA?
- Do we need to develop a new punicarapper for SD?
There was a problem hiding this comment.
- For diffusion, in most scenarios, it suffices to use only 1 LoRA. May leave that to future work
- Since the punica wrapper is mostly used for multiple LoRA management, I think we can leave this out for this PR. Though one caveat is that
punica_wrapperis too closely tied withBaseLinearLayerWithLoRA(handles LoRA calculation, and the reason why we initpunica_wrapperin the first place) butpunica_wrapperhas quite a few LLM-specific components.
Please check this implementation (commit 955a2cf) and see if this makes sense. TL;DR we still inherit from BaseLinearLayerWithLoRA in DiffusionBaseLayerWithLoRA but rewrite the apply function to eliminate the need of using punica_wrapper. The current diffusion linear layer design inherits both DiffusionBaseLayerWithLoRA and the self-defined layers in vLLM.
For a better design, probably can decouple punica_wrapper and BaseLinearLayerWithLoRA. But this can work as a temporary solution for now. Let me know what you think. Thanks
There was a problem hiding this comment.
SD users do sometimes stack multiple LoRAs. For this initial PR, we intentionally support a single active LoRA per diffusion execution (max_loras=1). We do allow multiple adapters to be cached on CPU and swapped per request. Also note that our current diffusion runner effectively executes one request per model execution (no cross-request batching), so max_loras=1 matches the existing execution semantics; when multiple requests are passed in one call, we require the same LoRARequest and lora_scale across the batch to avoid silently applying the wrong adapter. Multi‑LoRA composition (weighted mixing, per-sample different adapters, etc.) would require an explicit API for multiple adapters + weights and a more complex kernel/memory-management path, so we’d prefer to follow up in a separate PR if/when needed.
Re punica_wrapper: we don’t introduce a diffusion-specific punica wrapper here. We still inherit from vLLM’s BaseLinearLayerWithLoRA for weight/buffer management, but in DiffusionBaseLinearLayerWithLoRA we override apply() to compute the single‑LoRA delta via direct matmuls (same shrink+expand semantics as Punica) and handle packed projections per-slice (e.g. fused QKV), avoiding the LLM-specific dependencies in punica_wrapper.
vllm_omni/diffusion/lora/manager.py
Outdated
|
|
||
| if static_lora_path is not None: | ||
| logger.info("Loading static LoRA from %s with scale %.2f", static_lora_path, static_lora_scale) | ||
| static_request = LoRARequest( |
There was a problem hiding this comment.
This is a follow-up from @SamitHuang's comment under #657 to support both static/dynamic LoRA. Static here means to load the LoRA during the init stage when providing the path in od_config.
I suppose now that this PR unites the processing flow of static/dynamic LoRA (all via LoRARequest), can probably update the variable naming here as well to avoid confusion.
There was a problem hiding this comment.
This is a follow-up from @SamitHuang's comment under #657 to support both static/dynamic LoRA. Static here means to load the LoRA during the init stage when providing the path.
I suppose now that this PR unites the processing flow of static/dynamic LoRA (all via
LoRARequest), can probably update the variable naming here as well to avoid confusion.
update 01/16: updated variable naming for clarity.
vllm_omni/diffusion/lora/manager.py
Outdated
| lora_request: LoRARequest, | ||
| ) -> tuple[LoRAModel, PEFTHelper]: | ||
|
|
||
| supported_lora_modules = set(get_supported_lora_modules(self.pipeline)) |
There was a problem hiding this comment.
Does SD have any special layers that need to support LoRA? Can get_supported_lora_modules in vLLM be used directly?
There was a problem hiding this comment.
Can see _expand_expected_modules_for_merged_projections in L40 (called L163) to handle additional cases like add_kv_proj, to_qkv.
This PR has only been tested on SD. Might still need to investigate to see if we need to further expand to support other models.
There was a problem hiding this comment.
Good question. We do reuse vLLM’s get_supported_lora_modules() as the baseline, but we snapshot it before injecting LoRA wrappers — after replacement the original LinearBase modules live under .base_layer, which would make the helper return base_layer and break adapter matching across reloads.
For diffusion/SD we also need to cover merged/packed projections (e.g. to_qkv, add_kv_proj), so we expand the expected module set via _expand_expected_modules_for_merged_projections() and treat packed projections as multi-slice when replacing/activating. This has been validated on SD; we can extend the expansion map as we add more pipelines.
Signed-off-by: AndyZhou952 <[email protected]>
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| from vllm_omni.diffusion.lora.manager import DiffusionLoRAManager |
There was a problem hiding this comment.
perhaps to add LoRARequest in lora/request.py for external package import
There was a problem hiding this comment.
done - thanks for the suggestion. This makes sense since LoRARequest is a user-facing class. Now can use from vllm_omni.lora.request import LoRARequest.
There was a problem hiding this comment.
there are other scripts using the LoRARequest from vllm, such as input_prcessor.py, async_omni.py and serving_chat.py, may unify them as well
There was a problem hiding this comment.
there are other scripts using the LoRARequest from vllm, such as
input_prcessor.py,async_omni.pyandserving_chat.py, may unify them as well
done - thanks for the observation
Also per discussion, added in-house LoRAConfig within vllm-omni as well.
There was a problem hiding this comment.
in config/__init__.py, better add from vllm_omni.config.lora import LoRAConfig
There was a problem hiding this comment.
in
config/__init__.py, better addfrom vllm_omni.config.lora import LoRAConfig
done
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
Peft lora wrapper Signed-off-by: Andy Zhou <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: AndyZhou952 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 950e388e96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
PR is ready @SamitHuang, thanks :-) |
| import pytest | ||
| import torch | ||
|
|
||
| pytest.importorskip("flash_attn", reason="flash_attn is not installed") |
There was a problem hiding this comment.
Remove this. I'll enable this test later.
| @@ -0,0 +1,376 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
There was a problem hiding this comment.
If this test only needs cpu, please add this to https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/scripts/simple_test.sh
There was a problem hiding this comment.
Good suggestion. Added
| @@ -0,0 +1,152 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Could you add a test for this file?
There was a problem hiding this comment.
Added unit tests in tests/diffusion/lora/test_base_linear.py (multi-slice apply, reset fast-path to skip matmuls when inactive, and inactive-slice behavior).
vllm_omni/diffusion/lora/utils.py
Outdated
| # Known packed projections: accept their separate counterparts. | ||
| packed_expansions: dict[str, list[str]] = { | ||
| # diffusion: fused QKV | ||
| "to_qkv": ["to_q", "to_k", "to_v"], |
There was a problem hiding this comment.
Why we need these hard coding here?
There was a problem hiding this comment.
The hard-coded packed→submodule expansion is needed to support diffusion models with fused projections (e.g. to_qkv, w13) while many PEFT/diffusers LoRA checkpoints are saved against the logical sub-projections (e.g. to_q/to_k/to_v, w1/w3).
We pass expected_lora_modules into LoRAModel.from_local_checkpoint to filter loaded weights; without expanding these names, those submodule keys would be dropped at load time and the LoRA would never be applied. The mapping is intentionally small and only takes effect when the packed module exists in the model, so the impact is contained.
There was a problem hiding this comment.
My concern is that this solution won't scale well or remain transparent when we encounter a new packed layer in a future model.
There was a problem hiding this comment.
I refactored this to follow vLLM’s packed_modules_mapping pattern: the packed→sublayer mapping now lives with each diffusion transformer implementation (e.g. to_qkv -> [to_q, to_k, to_v], add_kv_proj -> [...], w13 -> [w1, w3]), instead of being hard-coded in the LoRA framework. DiffusionLoRAManager collects packed_modules_mapping from the pipeline modules at init and uses it to:
- expand expected_lora_modules so LoRA keys saved against sub-projections are not dropped at load time, and
- map per-sublayer LoRA weights onto packed LoRA layers during target-module matching.
This makes new packed layers explicit and transparent: adding support is done next to the model code (similar to how we already maintain stacked_params_mapping in load_weights()),without changing LoRA core logic.
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
|
@AndyZhou952 Final question, can we also add test results on QwenImageLightning? since it's a typical timestep distilled lora model and is important in Q1 roadmap. https://huggingface.co/lightx2v/Qwen-Image-Lightning |
Currently, the LoRA support in vLLM-Omni only supports peft loading (with I think we can consider making this available in a separate PR if such support is needed (quite a bit of refactoring may be needed). We keep the behavior consistent with base vLLM for this PR for now. |
Okay, let's support it in the next PR. |
Signed-off-by: dongbo910220 <[email protected]>
Signed-off-by: dongbo910220 <[email protected]>
|
LTGM, Thanks for the contribution |
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| """ |
There was a problem hiding this comment.
The intention is to delete only the newly added line, as opposed to removing the entire file.
| # -- typically a transformer layer | ||
| # used for torch compile optimizations | ||
| _repeated_blocks = ["QwenImageTransformerBlock"] | ||
| packed_modules_mapping = { |
There was a problem hiding this comment.
I don't think it's a good idea to put it here, because we also have something like this in load_weights function for every model
There was a problem hiding this comment.
I'll open a new PR to do it.
…lignment (vllm-project#758) Signed-off-by: AndyZhou952 <[email protected]> Signed-off-by: Andy Zhou <[email protected]> Signed-off-by: dongbo910220 <[email protected]> Signed-off-by: Andy Zhou <[email protected]> Signed-off-by: Samit <[email protected]> Co-authored-by: dongbo910220 <[email protected]> Co-authored-by: Samit <[email protected]> Signed-off-by: jzz <[email protected]>
This is a joint work by @AndyZhou952 and @dongbo910220.
Design doc here.
Purpose:
Following Issue #281 and PR #657, this PR adds diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment.
This PR reuses the LoRA RPC logic from #657 (thanks to @dongbo910220's implementation), while utilizing the vLLM self-defined layer for LoRA support and PEFT format to be incorporated in the reinforcement training pipeline.
How vLLM adds LoRA support:
Three steps: (1) initialization; (2) per-request; (3) inference (via vLLM self-defined LoRA layers, calculations in
forward).vLLM-Omni PEFT LoRA integration Design:
Besides
add_loraandremove_lora, we also supportpin_loraandlist_loraas public APIs to be consistent with the vLLM base behavior.Design principles:
Design choices:
PEFTHelperfrom vLLM, which will look for the fileadapter_config.jsonwhen loading LoRAs.DiffusionLoRAManagerdoes not inherit fromLoRAManager, since (1)LoRAManageris LLM-centric with redundant variables in__init__for diffusion models; (2) diffusion component-based nature requires separate treatment;DiffusionLoRAManagerto keep things compact, also no need for a separateWorkerLoRAManageras vllm-omni hasgpu_workerthat does the job. Also, LRU cache management is kept withinDiffusionLoRAManager.BaseLinearLayerWithLoRA, the calculation is done inself.punica_wrapper.add_linear_layer(). Note thatpunica_wrapperis used for multiple LoRA management. In most of the diffusion use cases, having one LoRA would be sufficient. One issue in the vLLM'sBaseLinearLayerWithLoRAis that it is too closely tied topunica_wrapper, while the current implementation forpunica_wrapperis not really suitable for the diffusion use case. As a temporary workaround, we defineclass DiffusionBaseLinearWithLoRA(BaseLinearLayerWithLoRA)and rewriteapply(where the LoRA calculation happens) to eliminate the dependency onpunica_wrapper.Functions/variables/classes reused from vLLM:
LoRARequestrequest structureget_supported_lora_modules,get_adapter_absolute_path,PEFTHelper.from_local_dir,LoRAModel.from_local_checkpoint,LoRALayerWeights.optimizefor scaling and use in_load_adapter.LoRAConfig,from_layer,replace_submodule,BaseLayerWithLoRA.set_mappingin_replace_layers_with_lorato substitute with vLLM self-defined LoRA layers.Current limitations:
Test Plan:
Start server with SD3.5 and LoRA:
python -m vllm_omni.entrypoints.openai.api_server
--model stabilityai/stable-diffusion-3.5-medium
--lora-dirs /path/to/lora-test
Send request with LoRA:
curl -X POST http://localhost:8000/v1/images/generations
-H "Content-Type: application/json"
-d '{"prompt": "A whimsical hand-drawn animation still of a small countryside train station at sunset,
warm golden light, lush greenery, soft watercolor textures, highly detailed, sharp focus", "lora": {"name": "rafadan", "local_path": "/path/to/lora.safetensors", "scale": 1.0}}'
Test Result:
1024x1024, steps=30, seed=42
1024x1024, steps=30, seed=42, lora scale = 1.0
Co-authored-by: dongbo910220
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)