-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[bugfix] support eagle with lora cudagraph specialization #28318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/v1/worker/gpu_model_runner.py
Outdated
| cudagraph_runtime_mode == CUDAGraphMode.PIECEWISE | ||
| and not self.speculative_config.enforce_eager | ||
| ) | ||
| self.drafter.dummy_run(num_tokens, use_cudagraphs=use_cudagraphs) | ||
| self.drafter.dummy_run( | ||
| num_tokens, | ||
| use_cudagraphs=use_cudagraphs, | ||
| batch_descriptor=batch_descriptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drafter inference still uses non‑LoRA cudagraph key
Passing batch_descriptor into EagleProposer.dummy_run ensures LoRA specialization is considered during warm‑up, but the actual drafter forward (EagleProposer.propose) still calls set_forward_context without a descriptor, so its cudagraph lookups always use a BatchDescriptor with has_lora=False. After switching from a no‑LoRA run to a LoRA‑enabled run the dummy run will capture a new graph under a has_lora=True key, while inference continues to fetch the pre‑existing has_lora=False entry and replays the stale graph, recreating the illegal memory error this change attempts to fix. The runtime path needs to receive the same batch_descriptor (or at least the LoRA flag) so drafter cudagraphs are keyed consistently with the dummy run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnovack does this make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the bot's comment makes sense, but we don't actually see the illegal memory error when running inference, even though the LoRA-specialized batch descriptor is not passed into the draft model.
I have a suspicion that root issue is something deeper in the custom allreduce kernel (which is where the illegal memory access occurs), which only occurs during cudagraph capture.
For context, we don't see this error when using eager mode, or when using cudagraphs with TP=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, we could add something like this before the call to self.drafter.dummy_run:
if self.compilation_config.cudagraph_specialize_lora and activate_lora:
use_cudagraphs = False
to ensure that we only capture one draft model cudagraph for each unique num_tokens. I confirmed that this is working locally as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug causing an illegal memory error when using LoRA with Eagle speculative decoding and CUDA graph specialization. The fix involves propagating the batch_descriptor, which contains LoRA specialization information, to the drafter.dummy_run method. This ensures that when a new CUDA graph is captured for the base model, the draft model's graph is also correctly specialized for LoRA, resolving the memory issue. The changes are minimal, targeted, and leverage the existing batch_descriptor mechanism effectively. This is a clean and correct approach to fixing the described problem.
Signed-off-by: gnovack <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: gnovack <[email protected]>
Head branch was pushed to by a user without write access
…ct#28318) Signed-off-by: gnovack <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…ct#28318) Signed-off-by: gnovack <[email protected]>
Purpose
Currently enabling LoRA with eagle spec decoding fails with an illegal memory error unless
cudagraph_specialize_lorais turned off. There seems to be an issue when capturing a new base model cudagraph, but reusing an existing draft model cudagraph (since no lora specialization information is provided to the draft modeldummy_run.This PR passes the
batch_descriptorcontaining LoRA specialization info todrafter.dummy_run. I confirmed locally that this resolved the illegal memory error, but would appreciate any thoughts on whether there might be a better/cleaner way to fix this. Thanks!Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.