[LoRA] LoRA cuda graph specialization #25914

andylolu2 · 2025-09-30T00:15:42Z

Purpose

Currently when enabling LoRA (with cuda graphs, which is necessary for reasonable speed) adds overhead to the normal inference path, even if there are no active LoRA adapters. This is because we currently only capture cuda graphs with LoRA operations included.

In this PR, I make some small changes to capture a different set of cuda graphs when there are no active LoRA adapters, so we get exactly the same speed as normal inference when there are no active LoRA requests.

Implementation

Added a new has_lora attribute to BatchDescriptor.
Capture two sets of cuda graphs while capturing.
At runtime correctly dispatch to the graphs w/ or w/o LoRA ops based on len(self.input_batch.lora_id_to_lora_request) > 0.
Move .zero() of the intermediate lora buffer to inside lora_shrink this way it will be skipped when there's no LoRAs active.

Test Plan

Show that LoRA still functionally works, but has zero-overhead when there's no active LoRAs.

Test Result

--enable-lora overhead reduced from 10.5% to 1.4%. I compared the kernels launched and they are identical to the --no-enable-lora case when there's no active LoRAs, so I suspect the 1.4% overhead is just from additional CPU-side logic.

Baseline

$ vllm bench latency --model meta-llama/Llama-2-7b-hf
Avg latency: 0.8444958463311195 seconds
10% percentile latency: 0.8407110057771205 seconds
25% percentile latency: 0.8423566690180451 seconds
50% percentile latency: 0.8437296429183334 seconds
75% percentile latency: 0.8471306945430115 seconds
90% percentile latency: 0.8484105261275545 seconds
99% percentile latency: 0.851516973322723 seconds

Before PR

$ vllm bench latency --model meta-llama/Llama-2-7b-hf --enable-lora
Avg latency: 0.9335442642603691 seconds
10% percentile latency: 0.9302553807385265 seconds
25% percentile latency: 0.9311426649801433 seconds
50% percentile latency: 0.9327256239484996 seconds
75% percentile latency: 0.9367578914971091 seconds
90% percentile latency: 0.9377115628449246 seconds
99% percentile latency: 0.9390150624723174 seconds

After PR

$ vllm bench latency --model meta-llama/Llama-2-7b-hf --enable-lora
Avg latency: 0.856378769610698 seconds
10% percentile latency: 0.8511904217069969 seconds
25% percentile latency: 0.8531599651905708 seconds
50% percentile latency: 0.8564562875544652 seconds
75% percentile latency: 0.8592897556954995 seconds
90% percentile latency: 0.8603642771951854 seconds
99% percentile latency: 0.8627826083195396 seconds

Functionality test

pytest -svx tests/lora/test_llama_tp.py

still passes (it uses cuda graphs).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-30T14:31:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @andylolu2.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

andylolu2 · 2025-10-01T02:42:59Z

/gemini review

andylolu2 · 2025-10-01T02:44:21Z

@gemini-code-assist

gemini-code-assist

Code Review

This pull request introduces a specialization for CUDA graphs with LoRA to optimize performance. The key idea is to capture two distinct sets of CUDA graphs: one for batches with active LoRA adapters and another for batches without. This avoids the overhead associated with LoRA operations when no LoRA adapters are in use. The changes include adding a has_lora attribute to BatchDescriptor for dispatching, modifying the CUDA graph capture logic to handle both scenarios, and moving the zeroing of the intermediate LoRA buffer to be conditional on LoRA activation. The implementation appears solid and the provided test results demonstrate a significant reduction in overhead, from 10.5% down to 1.4%, when LoRA is enabled but not active. The code changes are consistent and correctly implement the intended optimization. I have not found any critical or high-severity issues.

gemini-code-assist · 2025-10-01T02:49:35Z

Thank you for submitting this pull request, @andylolu2! I've reviewed the changes and appreciate the clear problem statement and detailed implementation notes.

This feature significantly improves efficiency by specializing CUDA graphs for LoRA and non-LoRA scenarios. The reported reduction in overhead from 10.5% to 1.4% is a substantial gain, directly addressing the stated purpose of minimizing overhead when LoRA is enabled but not actively used.

Key observations from the review:

Correctness: The logic for introducing has_lora to BatchDescriptor and using it for CUDA graph dispatching is sound. The conditional zeroing of the LoRA buffer within lora_shrink_op.py is a correct and efficient optimization.
Efficiency: This PR is a direct improvement in efficiency. By capturing separate CUDA graphs and optimizing buffer initialization, it effectively eliminates most of the overhead associated with LoRA when no adapters are active.
Maintainability: The changes are well-encapsulated within the relevant modules (forward_context.py, lora_shrink_op.py, punica_gpu.py, cudagraph_dispatcher.py, gpu_model_runner.py). The use of itertools.product for generating compilation cases is a clean and Pythonic approach. The inline comment explaining the torch.empty change is also helpful for future maintainers.
Security: No security concerns were identified with these changes.

Overall, this is a valuable enhancement that improves the performance characteristics of LoRA integration. Great work!

vllm/v1/cudagraph_dispatcher.py

vllm/lora/ops/triton_ops/lora_shrink_op.py

vllm/v1/worker/gpu_model_runner.py

varun-sundar-rabindranath · 2025-10-03T16:48:39Z

Thanks @andylolu2 .
Left some comments regarding some refactors.

cc @jeejeelee
cc @ProExpertProg @LucasWilkinson for cudagraph dispatching changes

jeejeelee · 2025-10-04T14:05:52Z

vllm/v1/cudagraph_dispatcher.py

-                self.add_cudagraph_key(
-                    cudagraph_mode.mixed_mode(),
-                    BatchDescriptor(num_tokens=bs, uniform_decode=False))
+                for has_lora in [True, False]:


QQ：Will this increase the memory consumption of the CUDA graph?

Yes it does, but not by too much.

Before PR: (with --enable-lora)

Free memory ... 1.62 GiB for CUDAGraph memory.

After PR: (with --enable-lora)

Free memory ... 2.38 GiB for CUDAGraph memory.

@andylolu2 can you also try this with just the base model (i.e. not enabling lora) to see that it doesn't affect the CUDA graph memory ? Thanks.

Baseline is: (without --enable-lora)

Free memory ... 1.14 GiB for CUDAGraph memory.

Signed-off-by: Andy Lo <[email protected]>

ProExpertProg · 2025-10-13T11:54:02Z

Cc @fhl2000 can you take a look?

vllm/lora/punica_wrapper/punica_gpu.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/worker/lora_model_runner_mixin.py

fhl2000

LGTM for the cudagraph dispatching stuff. Only some small thoughts on the Lora path.

Signed-off-by: Andy Lo <[email protected]>

jeejeelee

LGTM, @ProExpertProg please take another look

Signed-off-by: Jee Jee Li <[email protected]>

andylolu2 · 2025-10-18T22:23:14Z

@ProExpertProg @jeejeelee I see CI failures but seemingly unrelated. I scanned through them and all failures seem to be caused by

 RuntimeError: _moe_C::topk_softmax() is missing value for argument 'renormalize'. Declaration: _moe_C::topk_softmax(Tensor($0! -> ) topk_weights, Tensor($1! -> ) topk_indices, Tensor($2! -> ) token_expert_indices, Tensor gating_output, bool renormalize) -> ()

jeejeelee · 2025-10-20T02:28:57Z

Let me sync with main and test again

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Merged 8 commits from origin/main including: - PR vllm-project#26586: Eagle rejection sampler fix (previously cherry-picked) - LoRA CUDA graph specialization (vllm-project#25914) - Bee-8B VLM model support (vllm-project#27012) - Utilities reorganization (network_utils, async_utils, etc.) - Multiple bug fixes and improvements In-Tree Modifications: - Removed Eagle rejection sampler cherry-pick (now in upstream) - Kept Qwen3 tool parser fix (still needed, line 523) - Only 1 active in-tree modification remaining Plugin Compatibility: - All 10 plugin patches load successfully - No target class changes required - Clean merge with no conflicts Documentation Updates: - Updated IN_TREE_MODIFICATIONS.md (moved Eagle fix to Removed/Obsolete) - Updated CLAUDE.md merge history - Verified clean diff with origin/main (3 files, all documented) Signed-off-by: Pradyun Ramadorai <[email protected]>

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: 0xrushi <[email protected]>

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

andylolu2 changed the title ~~Specialized LoRA cuda graph~~ [Feature] Specialized LoRA cuda graph Sep 30, 2025

mergify bot added the v1 label Sep 30, 2025

mergify bot added the needs-rebase label Sep 30, 2025

andylolu2 force-pushed the andy/lora-cuda-graphs-specialization branch from a0ed199 to 00d1328 Compare October 1, 2025 02:36

mergify bot removed the needs-rebase label Oct 1, 2025

andylolu2 changed the title ~~[Feature] Specialized LoRA cuda graph~~ [Feature] LoRA cuda graph specialization Oct 1, 2025

andylolu2 marked this pull request as ready for review October 1, 2025 02:41

andylolu2 requested review from WoosukKwon, alexm-redhat, comaniac, jeejeelee, njhill, robertgshaw2-redhat and ywang96 as code owners October 1, 2025 02:41

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

andylolu2 changed the title ~~[Feature] LoRA cuda graph specialization~~ [LoRA] LoRA cuda graph specialization Oct 2, 2025

varun-sundar-rabindranath reviewed Oct 3, 2025

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath reviewed Oct 3, 2025

View reviewed changes

vllm/lora/ops/triton_ops/lora_shrink_op.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed Oct 3, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

jeejeelee reviewed Oct 4, 2025

View reviewed changes

andylolu2 added 3 commits October 5, 2025 18:10

Specialized LoRA cuda graph

19c8ec8

Signed-off-by: Andy Lo <[email protected]>

Default

a6361be

Signed-off-by: Andy Lo <[email protected]>

Zero buffer inside shrink op

b027fd2

Signed-off-by: Andy Lo <[email protected]>

andylolu2 force-pushed the andy/lora-cuda-graphs-specialization branch from 00d1328 to b027fd2 Compare October 5, 2025 18:12

Fix precommit

b56714c

Signed-off-by: Andy Lo <[email protected]>

fhl2000 reviewed Oct 13, 2025

View reviewed changes

vllm/lora/punica_wrapper/punica_gpu.py Show resolved Hide resolved

fhl2000 reviewed Oct 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

fhl2000 reviewed Oct 13, 2025

View reviewed changes

vllm/v1/worker/lora_model_runner_mixin.py Show resolved Hide resolved

fhl2000 reviewed Oct 13, 2025

View reviewed changes

Feedback

80648f9

Signed-off-by: Andy Lo <[email protected]>

fhl2000 approved these changes Oct 14, 2025

View reviewed changes

jeejeelee approved these changes Oct 15, 2025

View reviewed changes

Merge branch 'main' into andy/lora-cuda-graphs-specialization

320f991

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2025

jeejeelee enabled auto-merge (squash) October 16, 2025 14:49

jeejeelee added 3 commits October 17, 2025 01:56

Fix OOM

e396f8a

Signed-off-by: Jee Jee Li <[email protected]>

Merge branch 'main' into andy/lora-cuda-graphs-specialization

74b0103

Fix

886572a

Signed-off-by: Jee Jee Li <[email protected]>

dcmaddix mentioned this pull request Oct 18, 2025

[Feature][Kernel]FusedMoE LoRA #21229

Merged

4 tasks

Merge branch 'main' into andy/lora-cuda-graphs-specialization

b6aeb30

jeejeelee merged commit b63f214 into vllm-project:main Oct 20, 2025
51 checks passed

Ther-LF pushed a commit to Ther-LF/vllm that referenced this pull request Oct 20, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

5efad2a

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

7f38a22

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

18d712e

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

c0700ae

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: 0xrushi <[email protected]>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

a29b5ff

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: 0xrushi <[email protected]>

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

f7485a3

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

gnovack mentioned this pull request Nov 8, 2025

[Bug][LoRA]: Custom AR IMA during CG Capture with LoRA #28334

Open

1 task

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

8631b33

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[LoRA] LoRA cuda graph specialization (vllm-project#25914)

162fc27

Signed-off-by: Andy Lo <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Uh oh!

[LoRA] LoRA cuda graph specialization #25914

[LoRA] LoRA cuda graph specialization #25914

Uh oh!

Conversation

andylolu2 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 30, 2025

Uh oh!

andylolu2 commented Oct 1, 2025

Uh oh!

andylolu2 commented Oct 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Oct 3, 2025

Uh oh!

jeejeelee Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

andylolu2 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

andylolu2 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fhl2000 left a comment

Choose a reason for hiding this comment

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

andylolu2 commented Oct 18, 2025

Uh oh!

jeejeelee commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andylolu2 commented Sep 30, 2025 •

edited

Loading

andylolu2 Oct 7, 2025 •

edited

Loading

andylolu2 Oct 9, 2025 •

edited

Loading