[bugfix] support eagle with lora cudagraph specialization #28318

gnovack · 2025-11-07T18:53:56Z

Purpose

Currently enabling LoRA with eagle spec decoding fails with an illegal memory error unless cudagraph_specialize_lora is turned off. There seems to be an issue when capturing a new base model cudagraph, but reusing an existing draft model cudagraph (since no lora specialization information is provided to the draft model dummy_run.

This PR passes the batch_descriptor containing LoRA specialization info to drafter.dummy_run. I confirmed locally that this resolved the illegal memory error, but would appreciate any thoughts on whether there might be a better/cleaner way to fix this. Thanks!

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-07T18:59:07Z

vllm/v1/worker/gpu_model_runner.py

                    cudagraph_runtime_mode == CUDAGraphMode.PIECEWISE
                    and not self.speculative_config.enforce_eager
                )
-                self.drafter.dummy_run(num_tokens, use_cudagraphs=use_cudagraphs)
+                self.drafter.dummy_run(
+                    num_tokens,
+                    use_cudagraphs=use_cudagraphs,
+                    batch_descriptor=batch_descriptor,


Drafter inference still uses non‑LoRA cudagraph key

Passing batch_descriptor into EagleProposer.dummy_run ensures LoRA specialization is considered during warm‑up, but the actual drafter forward (EagleProposer.propose) still calls set_forward_context without a descriptor, so its cudagraph lookups always use a BatchDescriptor with has_lora=False. After switching from a no‑LoRA run to a LoRA‑enabled run the dummy run will capture a new graph under a has_lora=True key, while inference continues to fetch the pre‑existing has_lora=False entry and replays the stale graph, recreating the illegal memory error this change attempts to fix. The runtime path needs to receive the same batch_descriptor (or at least the LoRA flag) so drafter cudagraphs are keyed consistently with the dummy run.

Useful? React with 👍 / 👎.

@gnovack does this make sense to you?

Yeah, the bot's comment makes sense, but we don't actually see the illegal memory error when running inference, even though the LoRA-specialized batch descriptor is not passed into the draft model.

I have a suspicion that root issue is something deeper in the custom allreduce kernel (which is where the illegal memory access occurs), which only occurs during cudagraph capture.

For context, we don't see this error when using eager mode, or when using cudagraphs with TP=1.

alternatively, we could add something like this before the call to self.drafter.dummy_run:

if self.compilation_config.cudagraph_specialize_lora and activate_lora: use_cudagraphs = False

to ensure that we only capture one draft model cudagraph for each unique num_tokens. I confirmed that this is working locally as well

gemini-code-assist

Code Review

This pull request addresses a bug causing an illegal memory error when using LoRA with Eagle speculative decoding and CUDA graph specialization. The fix involves propagating the batch_descriptor, which contains LoRA specialization information, to the drafter.dummy_run method. This ensures that when a new CUDA graph is captured for the base model, the draft model's graph is also correctly specialized for LoRA, resolving the memory issue. The changes are minimal, targeted, and leverage the existing batch_descriptor mechanism effectively. This is a clean and correct approach to fixing the described problem.

Signed-off-by: gnovack <[email protected]>

…ct#28318) Signed-off-by: gnovack <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ct#28318) Signed-off-by: gnovack <[email protected]>

gnovack requested review from benchislett and luccafong as code owners November 7, 2025 18:53

mergify bot added speculative-decoding v1 labels Nov 7, 2025

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

simon-mo approved these changes Nov 7, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) November 7, 2025 20:47

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

dcmaddix mentioned this pull request Nov 7, 2025

[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding #21068

Merged

fix draft model cudagraph capture when cudagraph_specialize_lora is true

b9c2d49

Signed-off-by: gnovack <[email protected]>

auto-merge was automatically disabled November 7, 2025 23:58
Head branch was pushed to by a user without write access

gnovack force-pushed the sd-cudagraph branch from a6ac802 to b9c2d49 Compare November 7, 2025 23:58

simon-mo enabled auto-merge (squash) November 8, 2025 00:12

gnovack mentioned this pull request Nov 8, 2025

[Bug][LoRA]: Custom AR IMA during CG Capture with LoRA #28334

Open

1 task

link to illegal memory access github issue

eace919

Signed-off-by: gnovack <[email protected]>

auto-merge was automatically disabled November 8, 2025 01:09
Head branch was pushed to by a user without write access

simon-mo enabled auto-merge (squash) November 8, 2025 02:29

simon-mo merged commit 70af44f into vllm-project:main Nov 8, 2025
47 checks passed

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025

[bugfix] support eagle with lora cudagraph specialization (vllm-proje…

ba7fde4

…ct#28318) Signed-off-by: gnovack <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[bugfix] support eagle with lora cudagraph specialization (vllm-proje…

98ac96a

…ct#28318) Signed-off-by: gnovack <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[bugfix] support eagle with lora cudagraph specialization #28318

[bugfix] support eagle with lora cudagraph specialization #28318

Uh oh!

gnovack commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Uh oh!

simon-mo Nov 7, 2025

Uh oh!

gnovack Nov 7, 2025

Uh oh!

gnovack Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[bugfix] support eagle with lora cudagraph specialization #28318

[bugfix] support eagle with lora cudagraph specialization #28318

Uh oh!

Conversation

gnovack commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gnovack Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gnovack Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gnovack commented Nov 7, 2025 •

edited by github-actions bot

Loading