-
Notifications
You must be signed in to change notification settings - Fork 454
[Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment #758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 63 commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
2572e82
peft lora support
AndyZhou952 2387d27
add logging
AndyZhou952 4a9a0b1
fix add_kv_proj, static load
AndyZhou952 cf2890a
Merge branch 'vllm-project:main' into peft_lora
AndyZhou952 4f60ab0
fix evict_if_needed
AndyZhou952 146cca4
fix
AndyZhou952 168e507
add_lora/remove_lora apis, unite static/dynamic loading
AndyZhou952 1658fe6
Fix diffusion weight index path for subfolders
dongbo910220 4dc5db7
Merge pull request #1 from dongbo910220/peft_lora
AndyZhou952 ea49d01
Add LoRA list/pin APIs for diffusion
dongbo910220 898018e
add_adapter renaming
AndyZhou952 05b7743
fix typo
AndyZhou952 62732e8
offline example
AndyZhou952 aea5376
simplify logic, vllm_omni lora; README
AndyZhou952 d0abb9e
fix - single lora attempt w/ punica_wrapper
AndyZhou952 e2c6db1
fix naming
AndyZhou952 05e1e52
fix dim
AndyZhou952 6c01e51
diffusion self-defined layers
AndyZhou952 f701e27
rearrange utils'
AndyZhou952 ba1bb2d
Merge pull request #2 from AndyZhou952/peft_lora_wrapper
AndyZhou952 5a13fa4
in house LoRAConfig in vllm-omni
AndyZhou952 898a5a5
LoRARequest unifying substitution
AndyZhou952 3f696b9
LoRAConfig in init
AndyZhou952 d69ae54
update variable naming for clarity
AndyZhou952 989e04f
Diffusion LoRA: fix packed layers without punica
dongbo910220 f07d957
Examples: add online diffusion LoRA inference
dongbo910220 c5804e7
diffusion/lora: stabilize target modules for LoRA reload
dongbo910220 f6788cc
openai: support diffusion LoRA for AsyncOmni
dongbo910220 e19412a
openai: fix /v1/models in pure diffusion mode
dongbo910220 a118e9e
Merge remote-tracking branch 'origin/main' into peft_lora
dongbo910220 6d600eb
diffusion/lora: fix config alias, stable ids, and perf
dongbo910220 a235c13
pre-commit: pin actionlint and fix tests
dongbo910220 36218ae
add examples
AndyZhou952 22cf15d
fix lora scale
AndyZhou952 6dd0573
lora CI
AndyZhou952 40ab4f8
Merge branch 'main' into peft_lora
AndyZhou952 159e1d9
Merge branch 'main' into peft_lora
SamitHuang 31901a4
0.14.0 rebase
AndyZhou952 822f792
linting
AndyZhou952 5a0bc02
configurable cpu loras
AndyZhou952 90671c5
tests: use bfloat16 for diffusion LoRA manager
dongbo910220 8ebf401
tests: add diffusion LoRA e2e coverage
dongbo910220 5f0e5d1
tests: move diffusion LoRA e2e under offline_inference
dongbo910220 03e61cb
tests: make diffusion LoRA test real
dongbo910220 b90c172
openai: support per-request LoRA for images API
dongbo910220 7b4f183
tests: allow running diffusion LoRA e2e with local SD models
dongbo910220 61a3e6a
tests: use bfloat16 in diffusion lora manager test
dongbo910220 8b279ad
diffusion/lora: fix diffusers weights index and forward base attrs
dongbo910220 17d3711
separate lora testing to a separate py
AndyZhou952 931650e
linting
AndyZhou952 474fd98
ci/e2e: stabilize diffusion images LoRA tests
dongbo910220 8d61328
tests/e2e: remove LoRA test env overrides
dongbo910220 c800870
tests/e2e: drop --enforce-eager from images LoRA test
dongbo910220 8641f15
cleanup: drop vLLM version-compat imports
dongbo910220 fa8c2f3
cleanup: align vLLM imports with origin/main
dongbo910220 e668f33
diffusion/lora: import LoRAModel from vllm 0.14
dongbo910220 1c4a9da
tests/e2e: don't blanket-skip diffusion LoRA on ROCm
dongbo910220 950e388
LoRARequest import consistency from vllm_omni
AndyZhou952 ffed231
tests: add diffusion LoRA unit coverage
dongbo910220 4b02161
tests: move diffusion LoRA tests under diffusion/lora
dongbo910220 e883f97
tests: reorganize diffusion LoRA unit tests
dongbo910220 de80a98
tests: fix pre-commit formatting
dongbo910220 2411d54
Merge branch 'vllm-project:main' into peft_lora
AndyZhou952 2e4a153
diffusion/lora: source packed mapping from models
dongbo910220 0646f51
tests: reduce flakiness in images LoRA e2e
dongbo910220 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
107 changes: 107 additions & 0 deletions
107
docs/user_guide/examples/offline_inference/lora_inference.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # LoRA-Inference | ||
|
|
||
| Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>. | ||
|
|
||
| This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference. | ||
| The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni. | ||
|
|
||
| ## Overview | ||
|
|
||
| Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism: | ||
|
|
||
| - **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache) | ||
| - **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request | ||
|
|
||
| Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Pre-loaded LoRA (via --lora-path) | ||
|
|
||
| Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_preloaded.png | ||
| ``` | ||
|
|
||
| **Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request. | ||
|
|
||
| ### Per-request LoRA (via --lora-request-path) | ||
|
|
||
| Load a LoRA adapter on-demand for each request: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-request-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_per_request.png | ||
| ``` | ||
|
|
||
| ### No LoRA | ||
|
|
||
| If no LoRA request is provided, we will use the base model without any LoRA adapters: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_no_lora.png | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### LoRA Parameters | ||
|
|
||
| - `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path) | ||
| - `--lora-request-path`: Path to LoRA adapter folder for per-request loading | ||
| - `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path. | ||
| - `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter. | ||
|
|
||
| ### Standard Parameters | ||
|
|
||
| - `--prompt`: Text prompt for image generation (required) | ||
| - `--seed`: Random seed for reproducibility (default: 42) | ||
| - `--height`: Image height in pixels (default: 1024) | ||
| - `--width`: Image width in pixels (default: 1024) | ||
| - `--num_inference_steps`: Number of denoising steps (default: 50) | ||
| - `--output`: Output file path (default: `lora_output.png`) | ||
|
|
||
| ## How LoRA Works | ||
|
|
||
| All LoRA adapters are handled uniformly: | ||
|
|
||
| 1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path | ||
| 2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request | ||
| 3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated | ||
|
|
||
| The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned). | ||
|
|
||
| ## LoRA Adapter Format | ||
|
|
||
| LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` | ||
|
|
||
| ## Example materials | ||
|
|
||
| ??? abstract "lora_inference.py" | ||
| ``````py | ||
| --8<-- "examples/offline_inference/lora_inference/lora_inference.py" | ||
| `````` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # LoRA-Inference | ||
|
|
||
| Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>. | ||
|
|
||
| This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API. | ||
|
|
||
| > Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory). | ||
| > Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here. | ||
|
|
||
| ## Start Server | ||
|
|
||
| ```bash | ||
| # Pick a diffusion model (examples) | ||
| # export MODEL=stabilityai/stable-diffusion-3.5-medium | ||
| # export MODEL=Qwen/Qwen-Image | ||
|
|
||
| bash run_server.sh | ||
| ``` | ||
|
|
||
| ## Call API (curl) | ||
|
|
||
| ```bash | ||
| # Required: local LoRA folder on the server | ||
| export LORA_PATH=/path/to/lora_adapter | ||
|
|
||
| # Optional | ||
| export SERVER=http://localhost:8091 | ||
| export PROMPT="A piece of cheesecake" | ||
| export LORA_NAME=my_lora | ||
| export LORA_SCALE=1.0 | ||
| # Optional: if omitted, the server derives a stable id from LORA_PATH. | ||
| # export LORA_INT_ID=123 | ||
|
|
||
| bash run_curl_lora_inference.sh | ||
| ``` | ||
|
|
||
| ## Call API (Python) | ||
|
|
||
| ```bash | ||
| python openai_chat_client.py \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora_adapter \ | ||
| --lora-name my_lora \ | ||
| --lora-scale 1.0 \ | ||
| --output output.png | ||
| ``` | ||
|
|
||
| ## LoRA Format | ||
|
|
||
| LoRA adapters should be in PEFT format, for example: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` | ||
|
|
||
| ??? abstract "openai_chat_client.py" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/openai_chat_client.py" | ||
| `````` | ||
| ??? abstract "run_curl_lora_inference.sh" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh" | ||
| `````` | ||
| ??? abstract "run_server.sh" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/run_server.sh" | ||
| `````` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # LoRA Inference Examples | ||
|
|
||
| This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference. | ||
| The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni. | ||
|
|
||
| ## Overview | ||
|
|
||
| Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism: | ||
|
|
||
| - **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache) | ||
| - **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request | ||
|
|
||
| Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Pre-loaded LoRA (via --lora-path) | ||
|
|
||
| Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_preloaded.png | ||
| ``` | ||
|
|
||
| **Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request. | ||
|
|
||
| ### Per-request LoRA (via --lora-request-path) | ||
|
|
||
| Load a LoRA adapter on-demand for each request: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-request-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_per_request.png | ||
| ``` | ||
|
|
||
| ### No LoRA | ||
|
|
||
| If no LoRA request is provided, we will use the base model without any LoRA adapters: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_no_lora.png | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### LoRA Parameters | ||
|
|
||
| - `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path) | ||
| - `--lora-request-path`: Path to LoRA adapter folder for per-request loading | ||
| - `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path. | ||
| - `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter. | ||
|
|
||
| ### Standard Parameters | ||
|
|
||
| - `--prompt`: Text prompt for image generation (required) | ||
| - `--seed`: Random seed for reproducibility (default: 42) | ||
| - `--height`: Image height in pixels (default: 1024) | ||
| - `--width`: Image width in pixels (default: 1024) | ||
| - `--num_inference_steps`: Number of denoising steps (default: 50) | ||
| - `--output`: Output file path (default: `lora_output.png`) | ||
|
|
||
| ## How LoRA Works | ||
|
|
||
| All LoRA adapters are handled uniformly: | ||
|
|
||
| 1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path | ||
| 2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request | ||
| 3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated | ||
|
|
||
| The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned). | ||
|
|
||
| ## LoRA Adapter Format | ||
|
|
||
| LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An offline test may be more suitable here for consistency
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that an offline test is generally more “diffusion-consistent”, so I added an offline LoRA E2E (tests/e2e/offline_inference/test_diffusion_lora.py) to cover the core engine path. That said, this PR also adds per-request lora parsing and switching in the Images API, and an offline test can’t fully cover the end-to-end server → API → request → engine path.
So I’m keeping tests/e2e/online_serving/test_images_generations_lora.py as an API-level E2E for that part.