Skip to content

Conversation

@ayushsatyam146
Copy link
Contributor

@ayushsatyam146 ayushsatyam146 commented Aug 31, 2025

Fixes #23789. Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.

  • Pass padded token count to _prepare_inputs()
  • Update slot_mapping and num_actual_tokens to use padded sizes
  • Eliminates need for backend-specific workarounds

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical crash issue related to CUDA graph padding by ensuring attention metadata tensor sizes match the execution environment. The changes are well-targeted and logical. A new num_input_tokens parameter is introduced to _prepare_inputs to carry the padded token count. The call to _prepare_inputs is correctly moved to after padding is calculated in execute_model. The slot_mapping and num_actual_tokens are now correctly sized using this padded count. The implementation appears robust and correctly handles the backward compatibility case where no padding is applied. Overall, this is a solid fix.

@ayushsatyam146 ayushsatyam146 force-pushed the cudagraph-fix branch 2 times, most recently from c8c4642 to 6e87b60 Compare September 1, 2025 03:15
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! I think this might actually be a bit trickier then I initially thought; couple of comments

  1. if we are dispatching to full cudagraphs; then we should pad num_reqs to match what the cudagraph was captured for since this information is used by the attention backends too; this may require changes to the CudagraphDispatcher
  2. if we are dispatching to a piecewise CUDA-graph we don't need to pad num_reqs and its probably better not to (to avoid wasted work in attention)

@mergify
Copy link

mergify bot commented Sep 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 1, 2025
@ayushsatyam146
Copy link
Contributor Author

Thanks for the review @LucasWilkinson. I will work on the suggested changes. Do you have any idea on why am I getting CI failures on this consistently?

@LucasWilkinson
Copy link
Collaborator

Do you have any idea on why am I getting CI failures on this consistently?

The CI can be flaky; I would just keep merging/rebasing-off main at a pretty regularly; tests are usually fixed on main in pretty short order

@ayushsatyam146 ayushsatyam146 marked this pull request as draft September 2, 2025 05:36
@ayushsatyam146 ayushsatyam146 marked this pull request as ready for review September 2, 2025 08:12
@mergify mergify bot removed the needs-rebase label Sep 2, 2025
@ayushsatyam146 ayushsatyam146 force-pushed the cudagraph-fix branch 3 times, most recently from 098e689 to 6691ed4 Compare September 3, 2025 14:07
@ayushsatyam146
Copy link
Contributor Author

@LucasWilkinson I have accomodated your suggestions in the new changes. Can you please take a look, Thanks!

@ayushsatyam146 ayushsatyam146 force-pushed the cudagraph-fix branch 2 times, most recently from 8241342 to 6d8a083 Compare September 5, 2025 17:29
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
- Add missing Optional import to forward_context.py
- Add missing NamedTuple import to cudagraph_dispatcher.py
- Add missing Optional import to gpu_model_runner.py
- Fix indentation in gpu_model_runner.py (try-except block)
@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2025
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@mergify mergify bot removed the needs-rebase label Nov 12, 2025
Signed-off-by: Lucas Wilkinson <[email protected]>
@mergify
Copy link

mergify bot commented Nov 12, 2025

Documentation preview: https://vllm--24002.org.readthedocs.build/en/24002/

@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 12, 2025
Signed-off-by: Lucas Wilkinson <[email protected]>
@mergify
Copy link

mergify bot commented Nov 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase nvidia v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Attention]: Pad for cudagraphs before constructing attention metadata

5 participants