Skip to content

Conversation

@gnovack
Copy link
Contributor

@gnovack gnovack commented Nov 7, 2025

Purpose

Currently enabling LoRA with eagle spec decoding fails with an illegal memory error unless cudagraph_specialize_lora is turned off. There seems to be an issue when capturing a new base model cudagraph, but reusing an existing draft model cudagraph (since no lora specialization information is provided to the draft model dummy_run.

This PR passes the batch_descriptor containing LoRA specialization info to drafter.dummy_run. I confirmed locally that this resolved the illegal memory error, but would appreciate any thoughts on whether there might be a better/cleaner way to fix this. Thanks!


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 3585 to 3591
cudagraph_runtime_mode == CUDAGraphMode.PIECEWISE
and not self.speculative_config.enforce_eager
)
self.drafter.dummy_run(num_tokens, use_cudagraphs=use_cudagraphs)
self.drafter.dummy_run(
num_tokens,
use_cudagraphs=use_cudagraphs,
batch_descriptor=batch_descriptor,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drafter inference still uses non‑LoRA cudagraph key

Passing batch_descriptor into EagleProposer.dummy_run ensures LoRA specialization is considered during warm‑up, but the actual drafter forward (EagleProposer.propose) still calls set_forward_context without a descriptor, so its cudagraph lookups always use a BatchDescriptor with has_lora=False. After switching from a no‑LoRA run to a LoRA‑enabled run the dummy run will capture a new graph under a has_lora=True key, while inference continues to fetch the pre‑existing has_lora=False entry and replays the stale graph, recreating the illegal memory error this change attempts to fix. The runtime path needs to receive the same batch_descriptor (or at least the LoRA flag) so drafter cudagraphs are keyed consistently with the dummy run.

Useful? React with 👍 / 👎.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnovack does this make sense to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the bot's comment makes sense, but we don't actually see the illegal memory error when running inference, even though the LoRA-specialized batch descriptor is not passed into the draft model.

I have a suspicion that root issue is something deeper in the custom allreduce kernel (which is where the illegal memory access occurs), which only occurs during cudagraph capture.

For context, we don't see this error when using eager mode, or when using cudagraphs with TP=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we could add something like this before the call to self.drafter.dummy_run:

if self.compilation_config.cudagraph_specialize_lora and activate_lora:
    use_cudagraphs = False

to ensure that we only capture one draft model cudagraph for each unique num_tokens. I confirmed that this is working locally as well

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug causing an illegal memory error when using LoRA with Eagle speculative decoding and CUDA graph specialization. The fix involves propagating the batch_descriptor, which contains LoRA specialization information, to the drafter.dummy_run method. This ensures that when a new CUDA graph is captured for the base model, the draft model's graph is also correctly specialized for LoRA, resolving the memory issue. The changes are minimal, targeted, and leverage the existing batch_descriptor mechanism effectively. This is a clean and correct approach to fixing the described problem.

@simon-mo simon-mo enabled auto-merge (squash) November 7, 2025 20:47
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025
auto-merge was automatically disabled November 7, 2025 23:58

Head branch was pushed to by a user without write access

@simon-mo simon-mo enabled auto-merge (squash) November 8, 2025 00:12
auto-merge was automatically disabled November 8, 2025 01:09

Head branch was pushed to by a user without write access

@simon-mo simon-mo enabled auto-merge (squash) November 8, 2025 02:29
@simon-mo simon-mo merged commit 70af44f into vllm-project:main Nov 8, 2025
47 checks passed
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants