[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898

freeliuzc · 2026-01-06T03:24:58Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-01-06T03:25:04Z

Thanks for your contribution!

Copilot

Pull request overview

This is a cherry-pick PR that adds support for multi-step MTP (Multi-Token Prediction) with CUDA Graph optimization. The changes simplify the CUDA Graph capture logic and fix compatibility issues when using CUDA Graph in multi-step execution scenarios.

Simplifies CUDA Graph capture by removing separate Draft Model capture logic and using a unified approach
Fixes CUDA Graph error 700 by replacing paddle.clone with copy_ operation
Adds logic to prevent CUDA Graph capture on substeps after the first in multi-step execution

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
fastdeploy/worker/gpu_model_runner.py	Removes complex Draft Model CUDA Graph capture logic, updates expected_decode_len calculation, and simplifies log messages
fastdeploy/spec_decode/mtp.py	Adds substep tracking to `_initialize_forward_meta`, implements conditional CUDA Graph capture for multi-step scenarios, and fixes tensor copy operation for CUDA Graph compatibility

Copilot · 2026-01-06T03:27:54Z

fastdeploy/worker/gpu_model_runner.py

                    )
                    logger.info(
-                        f"Warm up the Target model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}"
+                        f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}"


The log message displays incorrect expected_decode_len value. The actual value used is self.speculative_config.num_speculative_tokens * 2 + 1 (line 2108), but the log message shows self.speculative_config.num_speculative_tokens. These should match for accurate logging.

Copilot · 2026-01-06T03:27:55Z

fastdeploy/spec_decode/mtp.py

                if self.num_model_steps > 1:
-                    self.last_seq_lens_this_time = paddle.clone(self.model_inputs["seq_lens_this_time"])
+                    self.last_seq_lens_this_time.copy_(self.model_inputs["seq_lens_this_time"], False)


The implementation inconsistency between CUDA and XPU code paths could lead to maintenance issues. In the CUDA path (line 967), paddle.clone is replaced with copy_ to avoid CUDA Graph error 700, but the XPU path at line 1091 still uses paddle.clone. For consistency and to prevent potential issues if XPU also supports CUDA Graph in the future, consider using the same approach in both code paths.

Copilot · 2026-01-06T03:27:55Z

fastdeploy/spec_decode/mtp.py

-    def _initialize_forward_meta(self, step_use_cudagraph: bool = False):
+    def _initialize_forward_meta(self, step_use_cudagraph: bool = False, is_dummy_run: bool = False, substep: int = 0):
        """
        Initialize forward meta and attention meta data


The new parameters is_dummy_run and substep added to the function signature lack documentation in the docstring. The docstring should describe what these parameters represent, their expected types, and when they should be used to maintain consistency with Python documentation best practices.

Suggested change

Initialize forward meta and attention meta data

Initialize forward meta and attention meta data.

Args:

step_use_cudagraph (bool): Whether this step is allowed to use CUDA Graph

capture. The final decision also depends on runtime configuration such

as ``self.draft_model_use_cudagraph`` and the values of ``is_dummy_run``

and ``substep``.

is_dummy_run (bool): Indicates that the current call is part of a dummy

warm-up run used only for CUDA Graph capture and not a real decoding

step. When ``True`` together with ``substep > 0``, CUDA Graph capture

is disabled for this call.

substep (int): The index of the current substep in multi-step execution,

starting from 0. Only the first substep (0) can be captured by CUDA

Graph; later substeps reuse the captured graph and therefore should not

be captured again.

codecov-commenter · 2026-01-06T05:17:18Z

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@f3ebd64). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/spec_decode/mtp.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.4    #5898   +/-   ##
==============================================
  Coverage               ?   57.88%           
==============================================
  Files                  ?      329           
  Lines                  ?    40768           
  Branches               ?     6205           
==============================================
  Hits                   ?    23597           
  Misses                 ?    15322           
  Partials               ?     1849

Flag	Coverage Δ
GPU	`57.88% <75.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

support multi-step draft-model with cudagraph

65f9a59

Copilot AI review requested due to automatic review settings January 6, 2026 03:24

freeliuzc had a problem deploying to Metax_ci January 6, 2026 03:25 — with GitHub Actions Failure

Copilot started reviewing on behalf of freeliuzc January 6, 2026 03:25 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Deleter-D approved these changes Jan 7, 2026

View reviewed changes

Jiang-Jia-Jun merged commit fb59f56 into PaddlePaddle:release/2.4 Jan 7, 2026
19 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898

[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898

Uh oh!

freeliuzc commented Jan 6, 2026

Uh oh!

paddle-bot bot commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

codecov-commenter commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        Initialize forward meta and attention meta data
+        Initialize forward meta and attention meta data.
+        Args:
+            step_use_cudagraph (bool): Whether this step is allowed to use CUDA Graph
+                capture. The final decision also depends on runtime configuration such
+                as ``self.draft_model_use_cudagraph`` and the values of ``is_dummy_run``
+                and ``substep``.
+            is_dummy_run (bool): Indicates that the current call is part of a dummy
+                warm-up run used only for CUDA Graph capture and not a real decoding
+                step. When ``True`` together with ``substep > 0``, CUDA Graph capture
+                is disabled for this call.
+            substep (int): The index of the current substep in multi-step execution,
+                starting from 0. Only the first substep (0) can be captured by CUDA
+                Graph; later substeps reuse the captured graph and therefore should not
+                be captured again.

[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898

[Cherry-Pick][CI]Support multi-step mtp with cudagraph(#5886) #5898

Uh oh!

Conversation

freeliuzc commented Jan 6, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 6, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants