-
Notifications
You must be signed in to change notification settings - Fork 681
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This is a cherry-pick PR that adds support for multi-step MTP (Multi-Token Prediction) with CUDA Graph optimization. The changes simplify the CUDA Graph capture logic and fix compatibility issues when using CUDA Graph in multi-step execution scenarios.
- Simplifies CUDA Graph capture by removing separate Draft Model capture logic and using a unified approach
- Fixes CUDA Graph error 700 by replacing
paddle.clonewithcopy_operation - Adds logic to prevent CUDA Graph capture on substeps after the first in multi-step execution
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | Removes complex Draft Model CUDA Graph capture logic, updates expected_decode_len calculation, and simplifies log messages |
| fastdeploy/spec_decode/mtp.py | Adds substep tracking to _initialize_forward_meta, implements conditional CUDA Graph capture for multi-step scenarios, and fixes tensor copy operation for CUDA Graph compatibility |
| ) | ||
| logger.info( | ||
| f"Warm up the Target model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}" | ||
| f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}" |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message displays incorrect expected_decode_len value. The actual value used is self.speculative_config.num_speculative_tokens * 2 + 1 (line 2108), but the log message shows self.speculative_config.num_speculative_tokens. These should match for accurate logging.
| if self.num_model_steps > 1: | ||
| self.last_seq_lens_this_time = paddle.clone(self.model_inputs["seq_lens_this_time"]) | ||
| self.last_seq_lens_this_time.copy_(self.model_inputs["seq_lens_this_time"], False) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation inconsistency between CUDA and XPU code paths could lead to maintenance issues. In the CUDA path (line 967), paddle.clone is replaced with copy_ to avoid CUDA Graph error 700, but the XPU path at line 1091 still uses paddle.clone. For consistency and to prevent potential issues if XPU also supports CUDA Graph in the future, consider using the same approach in both code paths.
| def _initialize_forward_meta(self, step_use_cudagraph: bool = False): | ||
| def _initialize_forward_meta(self, step_use_cudagraph: bool = False, is_dummy_run: bool = False, substep: int = 0): | ||
| """ | ||
| Initialize forward meta and attention meta data |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new parameters is_dummy_run and substep added to the function signature lack documentation in the docstring. The docstring should describe what these parameters represent, their expected types, and when they should be used to maintain consistency with Python documentation best practices.
| Initialize forward meta and attention meta data | |
| Initialize forward meta and attention meta data. | |
| Args: | |
| step_use_cudagraph (bool): Whether this step is allowed to use CUDA Graph | |
| capture. The final decision also depends on runtime configuration such | |
| as ``self.draft_model_use_cudagraph`` and the values of ``is_dummy_run`` | |
| and ``substep``. | |
| is_dummy_run (bool): Indicates that the current call is part of a dummy | |
| warm-up run used only for CUDA Graph capture and not a real decoding | |
| step. When ``True`` together with ``substep > 0``, CUDA Graph capture | |
| is disabled for this call. | |
| substep (int): The index of the current substep in multi-step execution, | |
| starting from 0. Only the first substep (0) can be captured by CUDA | |
| Graph; later substeps reuse the captured graph and therefore should not | |
| be captured again. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #5898 +/- ##
==============================================
Coverage ? 57.88%
==============================================
Files ? 329
Lines ? 40768
Branches ? 6205
==============================================
Hits ? 23597
Misses ? 15322
Partials ? 1849
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fb59f56
into
PaddlePaddle:release/2.4
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.