Skip to content
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
2572e82
peft lora support
AndyZhou952 Jan 12, 2026
2387d27
add logging
AndyZhou952 Jan 12, 2026
4a9a0b1
fix add_kv_proj, static load
AndyZhou952 Jan 12, 2026
cf2890a
Merge branch 'vllm-project:main' into peft_lora
AndyZhou952 Jan 13, 2026
4f60ab0
fix evict_if_needed
AndyZhou952 Jan 13, 2026
146cca4
fix
AndyZhou952 Jan 13, 2026
168e507
add_lora/remove_lora apis, unite static/dynamic loading
AndyZhou952 Jan 13, 2026
1658fe6
Fix diffusion weight index path for subfolders
dongbo910220 Jan 13, 2026
4dc5db7
Merge pull request #1 from dongbo910220/peft_lora
AndyZhou952 Jan 13, 2026
ea49d01
Add LoRA list/pin APIs for diffusion
dongbo910220 Jan 13, 2026
898018e
add_adapter renaming
AndyZhou952 Jan 14, 2026
05b7743
fix typo
AndyZhou952 Jan 14, 2026
62732e8
offline example
AndyZhou952 Jan 14, 2026
aea5376
simplify logic, vllm_omni lora; README
AndyZhou952 Jan 15, 2026
d0abb9e
fix - single lora attempt w/ punica_wrapper
AndyZhou952 Jan 15, 2026
e2c6db1
fix naming
AndyZhou952 Jan 15, 2026
05e1e52
fix dim
AndyZhou952 Jan 16, 2026
6c01e51
diffusion self-defined layers
AndyZhou952 Jan 16, 2026
f701e27
rearrange utils'
AndyZhou952 Jan 16, 2026
ba1bb2d
Merge pull request #2 from AndyZhou952/peft_lora_wrapper
AndyZhou952 Jan 16, 2026
5a13fa4
in house LoRAConfig in vllm-omni
AndyZhou952 Jan 16, 2026
898a5a5
LoRARequest unifying substitution
AndyZhou952 Jan 16, 2026
3f696b9
LoRAConfig in init
AndyZhou952 Jan 16, 2026
d69ae54
update variable naming for clarity
AndyZhou952 Jan 16, 2026
989e04f
Diffusion LoRA: fix packed layers without punica
dongbo910220 Jan 17, 2026
f07d957
Examples: add online diffusion LoRA inference
dongbo910220 Jan 17, 2026
c5804e7
diffusion/lora: stabilize target modules for LoRA reload
dongbo910220 Jan 17, 2026
f6788cc
openai: support diffusion LoRA for AsyncOmni
dongbo910220 Jan 17, 2026
e19412a
openai: fix /v1/models in pure diffusion mode
dongbo910220 Jan 17, 2026
a118e9e
Merge remote-tracking branch 'origin/main' into peft_lora
dongbo910220 Jan 18, 2026
6d600eb
diffusion/lora: fix config alias, stable ids, and perf
dongbo910220 Jan 18, 2026
a235c13
pre-commit: pin actionlint and fix tests
dongbo910220 Jan 18, 2026
36218ae
add examples
AndyZhou952 Jan 19, 2026
22cf15d
fix lora scale
AndyZhou952 Jan 19, 2026
6dd0573
lora CI
AndyZhou952 Jan 19, 2026
40ab4f8
Merge branch 'main' into peft_lora
AndyZhou952 Jan 20, 2026
159e1d9
Merge branch 'main' into peft_lora
SamitHuang Jan 21, 2026
31901a4
0.14.0 rebase
AndyZhou952 Jan 21, 2026
822f792
linting
AndyZhou952 Jan 21, 2026
5a0bc02
configurable cpu loras
AndyZhou952 Jan 21, 2026
90671c5
tests: use bfloat16 for diffusion LoRA manager
dongbo910220 Jan 21, 2026
8ebf401
tests: add diffusion LoRA e2e coverage
dongbo910220 Jan 21, 2026
5f0e5d1
tests: move diffusion LoRA e2e under offline_inference
dongbo910220 Jan 21, 2026
03e61cb
tests: make diffusion LoRA test real
dongbo910220 Jan 21, 2026
b90c172
openai: support per-request LoRA for images API
dongbo910220 Jan 21, 2026
7b4f183
tests: allow running diffusion LoRA e2e with local SD models
dongbo910220 Jan 21, 2026
61a3e6a
tests: use bfloat16 in diffusion lora manager test
dongbo910220 Jan 21, 2026
8b279ad
diffusion/lora: fix diffusers weights index and forward base attrs
dongbo910220 Jan 21, 2026
17d3711
separate lora testing to a separate py
AndyZhou952 Jan 22, 2026
931650e
linting
AndyZhou952 Jan 22, 2026
474fd98
ci/e2e: stabilize diffusion images LoRA tests
dongbo910220 Jan 22, 2026
8d61328
tests/e2e: remove LoRA test env overrides
dongbo910220 Jan 22, 2026
c800870
tests/e2e: drop --enforce-eager from images LoRA test
dongbo910220 Jan 22, 2026
8641f15
cleanup: drop vLLM version-compat imports
dongbo910220 Jan 22, 2026
fa8c2f3
cleanup: align vLLM imports with origin/main
dongbo910220 Jan 22, 2026
e668f33
diffusion/lora: import LoRAModel from vllm 0.14
dongbo910220 Jan 22, 2026
1c4a9da
tests/e2e: don't blanket-skip diffusion LoRA on ROCm
dongbo910220 Jan 22, 2026
950e388
LoRARequest import consistency from vllm_omni
AndyZhou952 Jan 23, 2026
ffed231
tests: add diffusion LoRA unit coverage
dongbo910220 Jan 25, 2026
4b02161
tests: move diffusion LoRA tests under diffusion/lora
dongbo910220 Jan 25, 2026
e883f97
tests: reorganize diffusion LoRA unit tests
dongbo910220 Jan 25, 2026
de80a98
tests: fix pre-commit formatting
dongbo910220 Jan 25, 2026
2411d54
Merge branch 'vllm-project:main' into peft_lora
AndyZhou952 Jan 26, 2026
2e4a153
diffusion/lora: source packed mapping from models
dongbo910220 Jan 26, 2026
0646f51
tests: reduce flakiness in images LoRA e2e
dongbo910220 Jan 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,23 @@ steps:
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Diffusion Images API LoRA E2E"
timeout_in_minutes: 20
depends_on: image-build
commands:
- pytest -s -v tests/e2e/online_serving/test_images_generations_lora.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An offline test may be more suitable here for consistency

Copy link
Contributor

@dongbo910220 dongbo910220 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that an offline test is generally more “diffusion-consistent”, so I added an offline LoRA E2E (tests/e2e/offline_inference/test_diffusion_lora.py) to cover the core engine path. That said, this PR also adds per-request lora parsing and switching in the Images API, and an offline test can’t fully cover the end-to-end server → API → request → engine path.
So I’m keeping tests/e2e/online_serving/test_images_generations_lora.py as an API-level E2E for that part.

agents:
queue: "gpu_1_queue" # g6.4xlarge instance on AWS, has 1 L4 GPU
plugins:
- docker#v5.2.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
environment:
- "HF_HOME=/fsx/hf_cache"
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Diffusion Model CPU offloading Test"
timeout_in_minutes: 20
depends_on: image-build
Expand Down
1 change: 1 addition & 0 deletions .buildkite/scripts/simple_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ VENV_PYTHON="${VENV_DIR}/bin/python"

"${VENV_PYTHON}" -m pytest -v -s tests/entrypoints/
"${VENV_PYTHON}" -m pytest -v -s tests/diffusion/cache/
"${VENV_PYTHON}" -m pytest -v -s tests/diffusion/lora/
"${VENV_PYTHON}" -m pytest -v -s tests/model_executor/models/qwen2_5_omni/test_audio_length.py
"${VENV_PYTHON}" -m pytest -v -s tests/worker/
"${VENV_PYTHON}" -m pytest -v -s tests/distributed/omni_connectors/test_kv_flow.py
5 changes: 4 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@ repos:
# only for staged files

- repo: https://github.com/rhysd/actionlint
rev: v1.7.9
# v1.7.8+ sets `go 1.24.0` in go.mod, which older Go toolchains (and most
# current CI images) cannot parse. Pin to v1.7.7 until actionlint fixes the
# go.mod directive.
rev: v1.7.7
hooks:
- id: actionlint
files: ^\.github/workflows/.*\.ya?ml$
Expand Down
107 changes: 107 additions & 0 deletions docs/user_guide/examples/offline_inference/lora_inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# LoRA-Inference

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>.

This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.

## Overview

Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:

- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request

Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.

## Usage

### Pre-loaded LoRA (via --lora-path)

Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_preloaded.png
```

**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.

### Per-request LoRA (via --lora-request-path)

Load a LoRA adapter on-demand for each request:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-request-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_per_request.png
```

### No LoRA

If no LoRA request is provided, we will use the base model without any LoRA adapters:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_no_lora.png
```

## Parameters

### LoRA Parameters

- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.

### Standard Parameters

- `--prompt`: Text prompt for image generation (required)
- `--seed`: Random seed for reproducibility (default: 42)
- `--height`: Image height in pixels (default: 1024)
- `--width`: Image width in pixels (default: 1024)
- `--num_inference_steps`: Number of denoising steps (default: 50)
- `--output`: Output file path (default: `lora_output.png`)

## How LoRA Works

All LoRA adapters are handled uniformly:

1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated

The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).

## LoRA Adapter Format

LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:

```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```

## Example materials

??? abstract "lora_inference.py"
``````py
--8<-- "examples/offline_inference/lora_inference/lora_inference.py"
``````
69 changes: 69 additions & 0 deletions docs/user_guide/examples/online_serving/lora_inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# LoRA-Inference

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>.

This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API.

> Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory).
> Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here.

## Start Server

```bash
# Pick a diffusion model (examples)
# export MODEL=stabilityai/stable-diffusion-3.5-medium
# export MODEL=Qwen/Qwen-Image

bash run_server.sh
```

## Call API (curl)

```bash
# Required: local LoRA folder on the server
export LORA_PATH=/path/to/lora_adapter

# Optional
export SERVER=http://localhost:8091
export PROMPT="A piece of cheesecake"
export LORA_NAME=my_lora
export LORA_SCALE=1.0
# Optional: if omitted, the server derives a stable id from LORA_PATH.
# export LORA_INT_ID=123

bash run_curl_lora_inference.sh
```

## Call API (Python)

```bash
python openai_chat_client.py \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora_adapter \
--lora-name my_lora \
--lora-scale 1.0 \
--output output.png
```

## LoRA Format

LoRA adapters should be in PEFT format, for example:

```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```

??? abstract "openai_chat_client.py"
``````py
--8<-- "examples/online_serving/lora_inference/openai_chat_client.py"
``````
??? abstract "run_curl_lora_inference.sh"
``````py
--8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh"
``````
??? abstract "run_server.sh"
``````py
--8<-- "examples/online_serving/lora_inference/run_server.sh"
``````
98 changes: 98 additions & 0 deletions examples/offline_inference/lora_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# LoRA Inference Examples

This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.

## Overview

Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:

- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request

Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.

## Usage

### Pre-loaded LoRA (via --lora-path)

Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_preloaded.png
```

**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.

### Per-request LoRA (via --lora-request-path)

Load a LoRA adapter on-demand for each request:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-request-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_per_request.png
```

### No LoRA

If no LoRA request is provided, we will use the base model without any LoRA adapters:

```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_no_lora.png
```

## Parameters

### LoRA Parameters

- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.

### Standard Parameters

- `--prompt`: Text prompt for image generation (required)
- `--seed`: Random seed for reproducibility (default: 42)
- `--height`: Image height in pixels (default: 1024)
- `--width`: Image width in pixels (default: 1024)
- `--num_inference_steps`: Number of denoising steps (default: 50)
- `--output`: Output file path (default: `lora_output.png`)

## How LoRA Works

All LoRA adapters are handled uniformly:

1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated

The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).

## LoRA Adapter Format

LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:

```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```
Loading