[Bug][LoRA]: Custom AR IMA during CG Capture with LoRA




### 🐛 Describe the bug

The custom AllReduce kernel fails with an illegal memory access error if is called within a capture cudagraph before the [`CustomAllReduce.capture` context manager](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/custom_all_reduce.py#L200) exits.

This was previously not a problem since no cudagraphs were replayed until after `CustomAllReduce.capture` exited; but after https://github.com/vllm-project/vllm/pull/25914 enabled LoRA cudagraph specialization, the dummy run is executed twice for each `num_tokens` (once for `activate_lora=True`, and once for `activate_lora=False`). 

If spec decoding is enabled this second dummy run triggers a replay of the draft model cudagraph (since it does not depend on the value of `activate_lora`) and thus an illegal memory access error.

A temporary fix for this was introduced by https://github.com/vllm-project/vllm/pull/28318, but I'm creating this issue to track a longer term resolution. 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug][LoRA]: Custom AR IMA during CG Capture with LoRA #28334

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug][LoRA]: Custom AR IMA during CG Capture with LoRA #28334

Description

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions