-
Notifications
You must be signed in to change notification settings - Fork 690
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5897
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This cherry-pick PR aims to support multi-step MTP (Multi-Token Prediction) with CUDA Graph by modifying the capture process and fixing CUDA Graph compatibility issues.
- Simplified CUDA Graph capture by removing separate draft model capture logic in
gpu_model_runner.py - Modified the expected decode length calculation for MTP warmup
- Enhanced
_initialize_forward_metato conditionally enable CUDA Graph based on substep during dummy runs - Fixed CUDA error 700 by replacing
paddle.clonewithcopy_in CUDA Graph mode
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | Removed complex draft model CUDA Graph capture logic, updated expected_decode_len calculation for MTP, simplified warmup logging |
| fastdeploy/spec_decode/mtp.py | Added parameters to _initialize_forward_meta for multi-step CUDA Graph support, replaced paddle.clone with copy_ to avoid CUDA error 700, added documentation about CUDA Graph capture requirements |
| # Initialize forward meta data | ||
| self._initialize_forward_meta(step_use_cudagraph=step_use_cudagraph) | ||
| self._initialize_forward_meta( | ||
| step_use_cudagraph=step_use_cudagraph, is_dummy_run=is_dummy_run, substep=substep |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'is_dummy_run' is used but not defined in the _propose method. This will cause a NameError at runtime when _initialize_forward_meta is called. The _propose method signature only includes 'step_use_cudagraph' as a parameter, but 'is_dummy_run' is being passed to _initialize_forward_meta. You need to either add 'is_dummy_run' as a parameter to the _propose method or determine it from existing state/attributes.
| step_use_cudagraph=step_use_cudagraph, is_dummy_run=is_dummy_run, substep=substep | |
| step_use_cudagraph=step_use_cudagraph, substep=substep |
| logger.info( | ||
| f"Warm up the Target model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}" | ||
| f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}" |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message is inconsistent with the actual expected_decode_len value being passed to _dummy_run. The log says 'expected_decode_len:{self.speculative_config.num_speculative_tokens}' but the actual parameter passed on line 1953 is 'self.speculative_config.num_speculative_tokens * 2 + 1'. The log message should reflect the actual value being used to avoid confusion during debugging.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/online/20251131 #5897 +/- ##
==========================================================
Coverage ? 58.50%
==========================================================
Files ? 320
Lines ? 39181
Branches ? 5909
==========================================================
Hits ? 22923
Misses ? 14425
Partials ? 1833
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a0a518f to
a61c3fc
Compare
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
43dc335
into
PaddlePaddle:release/online/20251131
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.