vllm-project · david6666666 · Jan 27, 2026 · Jan 12, 2026 · Jan 12, 2026 · Jan 12, 2026
@@ -41,6 +41,23 @@ steps:
           volumes:
             - "/fsx/hf_cache:/fsx/hf_cache"
 
+  - label: "Diffusion Images API LoRA E2E"
+    timeout_in_minutes: 20
+    depends_on: image-build
+    commands:
+      - pytest -s -v tests/e2e/online_serving/test_images_generations_lora.py
+    agents:
+      queue: "gpu_1_queue" # g6.4xlarge instance on AWS, has 1 L4 GPU
+    plugins:
+      - docker#v5.2.0:
+          image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
+          always-pull: true
+          propagate-environment: true
+          environment:
+            - "HF_HOME=/fsx/hf_cache"
+          volumes:
+            - "/fsx/hf_cache:/fsx/hf_cache"
+
   - label: "Diffusion Model CPU offloading Test"
     timeout_in_minutes: 20
     depends_on: image-build

@@ -52,6 +52,7 @@ VENV_PYTHON="${VENV_DIR}/bin/python"
 
 "${VENV_PYTHON}" -m pytest -v -s tests/entrypoints/
 "${VENV_PYTHON}" -m pytest -v -s tests/diffusion/cache/
+"${VENV_PYTHON}" -m pytest -v -s tests/diffusion/lora/
 "${VENV_PYTHON}" -m pytest -v -s tests/model_executor/models/qwen2_5_omni/test_audio_length.py
 "${VENV_PYTHON}" -m pytest -v -s tests/worker/
 "${VENV_PYTHON}" -m pytest -v -s tests/distributed/omni_connectors/test_kv_flow.py
@@ -29,7 +29,10 @@ repos:
         # only for staged files
 
   - repo: https://github.com/rhysd/actionlint
-    rev: v1.7.9
+    # v1.7.8+ sets `go 1.24.0` in go.mod, which older Go toolchains (and most
+    # current CI images) cannot parse. Pin to v1.7.7 until actionlint fixes the
+    # go.mod directive.
+    rev: v1.7.7
     hooks:
       - id: actionlint
         files: ^\.github/workflows/.*\.ya?ml$

diff --git a/docs/user_guide/examples/offline_inference/lora_inference.md b/docs/user_guide/examples/offline_inference/lora_inference.md
@@ -0,0 +1,107 @@
+# LoRA-Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>.
+
+This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
+The example uses the  `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
+
+## Overview
+
+Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
+
+- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
+- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
+
+Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
+
+## Usage
+
+### Pre-loaded LoRA (via --lora-path)
+
+Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_preloaded.png
+```
+
+**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
+
+### Per-request LoRA (via --lora-request-path)
+
+Load a LoRA adapter on-demand for each request:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-request-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_per_request.png
+```
+
+### No LoRA
+
+If no LoRA request is provided, we will use the base model without any LoRA adapters:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_no_lora.png
+```
+
+## Parameters
+
+### LoRA Parameters
+
+- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
+- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
+- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
+- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
+
+### Standard Parameters
+
+- `--prompt`: Text prompt for image generation (required)
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--height`: Image height in pixels (default: 1024)
+- `--width`: Image width in pixels (default: 1024)
+- `--num_inference_steps`: Number of denoising steps (default: 50)
+- `--output`: Output file path (default: `lora_output.png`)
+
+## How LoRA Works
+
+All LoRA adapters are handled uniformly:
+
+1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
+2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
+3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated
+
+The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).
+
+## LoRA Adapter Format
+
+LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:
+
+```
+lora_adapter/
+├── adapter_config.json
+└── adapter_model.safetensors
+```
+
+## Example materials
+
+??? abstract "lora_inference.py"
+    ``````py
+    --8<-- "examples/offline_inference/lora_inference/lora_inference.py"
+    ``````
diff --git a/docs/user_guide/examples/online_serving/lora_inference.md b/docs/user_guide/examples/online_serving/lora_inference.md
@@ -0,0 +1,69 @@
+# LoRA-Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>.
+
+This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API.
+
+> Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory).
+> Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here.
+
+## Start Server
+
+```bash
+# Pick a diffusion model (examples)
+# export MODEL=stabilityai/stable-diffusion-3.5-medium
+# export MODEL=Qwen/Qwen-Image
+
+bash run_server.sh
+```
+
+## Call API (curl)
+
+```bash
+# Required: local LoRA folder on the server
+export LORA_PATH=/path/to/lora_adapter
+
+# Optional
+export SERVER=http://localhost:8091
+export PROMPT="A piece of cheesecake"
+export LORA_NAME=my_lora
+export LORA_SCALE=1.0
+# Optional: if omitted, the server derives a stable id from LORA_PATH.
+# export LORA_INT_ID=123
+
+bash run_curl_lora_inference.sh
+```
+
+## Call API (Python)
+
+```bash
+python openai_chat_client.py \
+  --prompt "A piece of cheesecake" \
+  --lora-path /path/to/lora_adapter \
+  --lora-name my_lora \
+  --lora-scale 1.0 \
+  --output output.png
+```
+
+## LoRA Format
+
+LoRA adapters should be in PEFT format, for example:
+
+```
+lora_adapter/
+├── adapter_config.json
+└── adapter_model.safetensors
+```
+
+??? abstract "openai_chat_client.py"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/openai_chat_client.py"
+    ``````
+??? abstract "run_curl_lora_inference.sh"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh"
+    ``````
+??? abstract "run_server.sh"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/run_server.sh"
+    ``````
@@ -0,0 +1,98 @@
+# LoRA Inference Examples
+
+This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
+The example uses the  `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
+
+## Overview
+
+Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
+
+- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
+- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
+
+Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
+
+## Usage
+
+### Pre-loaded LoRA (via --lora-path)
+
+Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_preloaded.png
+```
+
+**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
+
+### Per-request LoRA (via --lora-request-path)
+
+Load a LoRA adapter on-demand for each request:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-request-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_per_request.png
+```
+
+### No LoRA
+
+If no LoRA request is provided, we will use the base model without any LoRA adapters:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_no_lora.png
+```
+
+## Parameters
+
+### LoRA Parameters
+
+- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
+- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
+- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
+- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
+
+### Standard Parameters
+
+- `--prompt`: Text prompt for image generation (required)
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--height`: Image height in pixels (default: 1024)
+- `--width`: Image width in pixels (default: 1024)
+- `--num_inference_steps`: Number of denoising steps (default: 50)
+- `--output`: Output file path (default: `lora_output.png`)
+
+## How LoRA Works
+
+All LoRA adapters are handled uniformly:
+
+1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
+2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
+3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated
+
+The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).
+
+## LoRA Adapter Format
+
+LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:
+
+```
+lora_adapter/
+├── adapter_config.json
+└── adapter_model.safetensors
+```