-
Notifications
You must be signed in to change notification settings - Fork 454
[Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment #758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 48 commits
2572e82
2387d27
4a9a0b1
cf2890a
4f60ab0
146cca4
168e507
1658fe6
4dc5db7
ea49d01
898018e
05b7743
62732e8
aea5376
d0abb9e
e2c6db1
05e1e52
6c01e51
f701e27
ba1bb2d
5a13fa4
898a5a5
3f696b9
d69ae54
989e04f
f07d957
c5804e7
f6788cc
e19412a
a118e9e
6d600eb
a235c13
36218ae
22cf15d
6dd0573
40ab4f8
159e1d9
31901a4
822f792
5a0bc02
90671c5
8ebf401
5f0e5d1
03e61cb
b90c172
7b4f183
61a3e6a
8b279ad
17d3711
931650e
474fd98
8d61328
c800870
8641f15
fa8c2f3
e668f33
1c4a9da
950e388
ffed231
4b02161
e883f97
de80a98
2411d54
2e4a153
0646f51
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,6 +42,23 @@ steps: | |
| volumes: | ||
| - "/fsx/hf_cache:/fsx/hf_cache" | ||
|
|
||
| - label: "Diffusion Images API LoRA E2E" | ||
| timeout_in_minutes: 30 | ||
| depends_on: image-build | ||
| commands: | ||
| - pytest -s -v tests/e2e/online_serving/test_images_generations_lora.py | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An offline test may be more suitable here for consistency
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that an offline test is generally more “diffusion-consistent”, so I added an offline LoRA E2E (tests/e2e/offline_inference/test_diffusion_lora.py) to cover the core engine path. That said, this PR also adds per-request lora parsing and switching in the Images API, and an offline test can’t fully cover the end-to-end server → API → request → engine path. |
||
| agents: | ||
| queue: "gpu_1_queue" # g6.4xlarge instance on AWS, has 1 L4 GPU | ||
| plugins: | ||
| - docker#v5.2.0: | ||
| image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT | ||
| always-pull: true | ||
| propagate-environment: true | ||
| environment: | ||
| - "HF_HOME=/fsx/hf_cache" | ||
| volumes: | ||
| - "/fsx/hf_cache:/fsx/hf_cache" | ||
|
|
||
| - label: "Diffusion Model CPU offloading Test" | ||
| timeout_in_minutes: 20 | ||
| depends_on: image-build | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # LoRA-Inference | ||
|
|
||
| Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>. | ||
|
|
||
| This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference. | ||
| The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni. | ||
|
|
||
| ## Overview | ||
|
|
||
| Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism: | ||
|
|
||
| - **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache) | ||
| - **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request | ||
|
|
||
| Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Pre-loaded LoRA (via --lora-path) | ||
|
|
||
| Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_preloaded.png | ||
| ``` | ||
|
|
||
| **Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request. | ||
|
|
||
| ### Per-request LoRA (via --lora-request-path) | ||
|
|
||
| Load a LoRA adapter on-demand for each request: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-request-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_per_request.png | ||
| ``` | ||
|
|
||
| ### No LoRA | ||
|
|
||
| If no LoRA request is provided, we will use the base model without any LoRA adapters: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_no_lora.png | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### LoRA Parameters | ||
|
|
||
| - `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path) | ||
| - `--lora-request-path`: Path to LoRA adapter folder for per-request loading | ||
| - `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path. | ||
| - `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter. | ||
|
|
||
| ### Standard Parameters | ||
|
|
||
| - `--prompt`: Text prompt for image generation (required) | ||
| - `--seed`: Random seed for reproducibility (default: 42) | ||
| - `--height`: Image height in pixels (default: 1024) | ||
| - `--width`: Image width in pixels (default: 1024) | ||
| - `--num_inference_steps`: Number of denoising steps (default: 50) | ||
| - `--output`: Output file path (default: `lora_output.png`) | ||
|
|
||
| ## How LoRA Works | ||
|
|
||
| All LoRA adapters are handled uniformly: | ||
|
|
||
| 1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path | ||
| 2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request | ||
| 3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated | ||
|
|
||
| The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned). | ||
|
|
||
| ## LoRA Adapter Format | ||
|
|
||
| LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` | ||
|
|
||
| ## Example materials | ||
|
|
||
| ??? abstract "lora_inference.py" | ||
| ``````py | ||
| --8<-- "examples/offline_inference/lora_inference/lora_inference.py" | ||
| `````` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # LoRA-Inference | ||
|
|
||
| Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>. | ||
|
|
||
| This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API. | ||
|
|
||
| > Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory). | ||
| > Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here. | ||
|
|
||
| ## Start Server | ||
|
|
||
| ```bash | ||
| # Pick a diffusion model (examples) | ||
| # export MODEL=stabilityai/stable-diffusion-3.5-medium | ||
| # export MODEL=Qwen/Qwen-Image | ||
|
|
||
| bash run_server.sh | ||
| ``` | ||
|
|
||
| ## Call API (curl) | ||
|
|
||
| ```bash | ||
| # Required: local LoRA folder on the server | ||
| export LORA_PATH=/path/to/lora_adapter | ||
|
|
||
| # Optional | ||
| export SERVER=http://localhost:8091 | ||
| export PROMPT="A piece of cheesecake" | ||
| export LORA_NAME=my_lora | ||
| export LORA_SCALE=1.0 | ||
| # Optional: if omitted, the server derives a stable id from LORA_PATH. | ||
| # export LORA_INT_ID=123 | ||
|
|
||
| bash run_curl_lora_inference.sh | ||
| ``` | ||
|
|
||
| ## Call API (Python) | ||
|
|
||
| ```bash | ||
| python openai_chat_client.py \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora_adapter \ | ||
| --lora-name my_lora \ | ||
| --lora-scale 1.0 \ | ||
| --output output.png | ||
| ``` | ||
|
|
||
| ## LoRA Format | ||
|
|
||
| LoRA adapters should be in PEFT format, for example: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` | ||
|
|
||
| ??? abstract "openai_chat_client.py" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/openai_chat_client.py" | ||
| `````` | ||
| ??? abstract "run_curl_lora_inference.sh" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh" | ||
| `````` | ||
| ??? abstract "run_server.sh" | ||
| ``````py | ||
| --8<-- "examples/online_serving/lora_inference/run_server.sh" | ||
| `````` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # LoRA Inference Examples | ||
|
|
||
| This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference. | ||
| The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni. | ||
|
|
||
| ## Overview | ||
|
|
||
| Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism: | ||
|
|
||
| - **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache) | ||
| - **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request | ||
|
|
||
| Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Pre-loaded LoRA (via --lora-path) | ||
|
|
||
| Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_preloaded.png | ||
| ``` | ||
|
|
||
| **Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request. | ||
|
|
||
| ### Per-request LoRA (via --lora-request-path) | ||
|
|
||
| Load a LoRA adapter on-demand for each request: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --lora-request-path /path/to/lora/ \ | ||
| --lora-scale 1.0 \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_per_request.png | ||
| ``` | ||
|
|
||
| ### No LoRA | ||
|
|
||
| If no LoRA request is provided, we will use the base model without any LoRA adapters: | ||
|
|
||
| ```bash | ||
| python -m examples.offline_inference.lora_inference.lora_inference \ | ||
| --prompt "A piece of cheesecake" \ | ||
| --num_inference_steps 50 \ | ||
| --height 1024 \ | ||
| --width 1024 \ | ||
| --output output_no_lora.png | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| ### LoRA Parameters | ||
|
|
||
| - `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path) | ||
| - `--lora-request-path`: Path to LoRA adapter folder for per-request loading | ||
| - `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path. | ||
| - `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter. | ||
|
|
||
| ### Standard Parameters | ||
|
|
||
| - `--prompt`: Text prompt for image generation (required) | ||
| - `--seed`: Random seed for reproducibility (default: 42) | ||
| - `--height`: Image height in pixels (default: 1024) | ||
| - `--width`: Image width in pixels (default: 1024) | ||
| - `--num_inference_steps`: Number of denoising steps (default: 50) | ||
| - `--output`: Output file path (default: `lora_output.png`) | ||
|
|
||
| ## How LoRA Works | ||
|
|
||
| All LoRA adapters are handled uniformly: | ||
|
|
||
| 1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path | ||
| 2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request | ||
| 3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated | ||
|
|
||
| The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned). | ||
|
|
||
| ## LoRA Adapter Format | ||
|
|
||
| LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure: | ||
|
|
||
| ``` | ||
| lora_adapter/ | ||
| ├── adapter_config.json | ||
| └── adapter_model.safetensors | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this test needs 30mins? can you optimize the test design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will optimize the test later. In practice the test usually finishes in a few minutes. Will reduce the timeout and also optimize runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve shortened this step’s timeout_in_minutes to 20 min.