-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
🐛 Describe the bug
The custom AllReduce kernel fails with an illegal memory access error if is called within a capture cudagraph before the CustomAllReduce.capture context manager exits.
This was previously not a problem since no cudagraphs were replayed until after CustomAllReduce.capture exited; but after #25914 enabled LoRA cudagraph specialization, the dummy run is executed twice for each num_tokens (once for activate_lora=True, and once for activate_lora=False).
If spec decoding is enabled this second dummy run triggers a replay of the draft model cudagraph (since it does not depend on the value of activate_lora) and thus an illegal memory access error.
A temporary fix for this was introduced by #28318, but I'm creating this issue to track a longer term resolution.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.