[RL] R3 Support Overlap Schedule by gongshaotian · Pull Request #7674 · PaddlePaddle/FastDeploy

gongshaotian · 2026-04-29T13:05:26Z

Motivation

R3 (Rollout Routing Replay) previously performed routing data transfer via a synchronous D2H copy
(gpu_routing_buffer.cpu().numpy()) inside post_process, which blocked the CPU and was incompatible with Overlap
Schedule. This PR redesigns the R3 D2H pipeline to be fully async and overlap-friendly.

TODO:

Rename gpu_routing_buffer as device_routing_buffer
Reuse the state of radix tree nodes when swapping the KV cache with the routing cache.
Cherry-pick to Develop

Modifications

RoutedExpertsCapturer: removed block_table dependency and pending_update_positions; added routing_staging_buf,
slot_mapping_staging_buf (GPU), cpu_routing_buf, cpu_slot_mapping_buf (CPU pinned), and _pending_save state
- prepare_pending_save(): enqueues D2D (GPU → staging) and async D2H (staging → CPU pinned) on the current CUDA
  stream before post_process_event.record(); returns immediately without blocking the CPU
- flush_pending_save(): pure CPU scatter from pinned buffers to SharedMemory; called after
  post_process_event.synchronize() in both execute_model and execute_model_overlap, guaranteeing all transfers are
  complete
- post_process_normal / post_process_speculate: replaced compute_slot_mapping_flat + save_captured_routing with
  prepare_pending_save, using slot_mapping_buffer from share_inputs directly instead of recomputing from positions
- gpu_model_runner.py: extended _compute_slot_mapping to also run when R3 is enabled (previously only triggered
  for MLA/DSA backends); removed the pre-forward get_token_positions call; added flush_pending_save() at the
  synchronize point for both normal and overlap execution paths

Usage or Command

None

Accuracy Tests

None

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

into r3_overlap_schedule_2.6

paddle-bot · 2026-04-29T13:05:35Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-04-29T13:51:16Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 20:13:51

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: bed58de
Merge base: 66dea60 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 有 2 个 Required 任务失败，1 个 Required 任务运行中，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	29	5	1	1	0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	10s	PR问题：缺少审批，custom op需FastDeploy+PaddlePaddle RD approve	请 dangqingqing/jiangjiajun 等指定RD进行审批	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	12m33s	PR问题：test_r3_accuracy/lm_head_fp32 HTTP连接断开，服务疑崩溃	检查R3 Overlap Schedule实现，排查服务端崩溃原因	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	-	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m50s	Job	-
❌	`Check PR Template`	13s	Job	-
❌	`Trigger Jenkins for PR`	39s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范/审批流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: PR缺少指定RD审批，custom op需FastDeploy+PaddlePaddle RD approve
分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 检测到 3 处审批规则未满足：(1) 添加 custom op 需要至少一名 FastDeploy RD（dangqingqing/jiangjiajun/dengkaipeng）审批；(2) 添加 custom op 还需至少一名 PaddlePaddle RD（gaoxiang/mayongqiang）审批；(3) 合入 release/2.6 分支的 PR 需遵循 Cherry-Pick 流程，PR 标题须含 [Cherry-Pick] 及原 develop PR 编号（如 #5010）。

关键日志:

==> PR title: [RL] R3 Support Overlap Schedule
0. You must have one FastDeploy RD (...) approval for adding custom op.
1. You must have one PaddlePaddle RD (...) approval for adding custom op.
2. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] ...
There are 3 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 qingqing01(dangqingqing)、Jiang-Jia-Jun(jiangjiajun) 或 heavengate(dengkaipeng) 中至少一人 approve 本 PR（custom op 要求）
请 jeff41404(gaoxiang) 或 yongqiangma(mayongqiang) 中至少一人 approve 本 PR（PaddlePaddle RD 要求）
如需合入 release/2.6，请确认是否需将 PR 改为 Cherry-Pick 形式，标题加 [Cherry-Pick] 及原 develop PR 编号

修复建议摘要: 请指定 FastDeploy RD 和 PaddlePaddle RD 审批，并确认 Cherry-Pick 流程

链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 用例失败（置信度: 中）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 用例失败
置信度: 中
根因摘要: test_r3_accuracy/lm_head_fp32 HTTP连接断开，R3 Overlap Schedule疑致服务崩溃
分析器: 通用分析(fallback)

失败用例:

测试	错误	根因
`test_GLM_45_AIR_mtp_tp4.py::test_lm_head_fp32`	http.client.RemoteDisconnected	服务端连接被强制断开
`test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy`	httpx ConnectError	服务端无法建立连接

根因详情:
两个测试均在 HTTP 请求阶段失败，报 RemoteDisconnected 和 httpcore connect_tcp 错误，表明推理服务在测试请求时已崩溃或未正常监听。PR 引入了 R3 Overlap Schedule 功能，test_r3_accuracy 直接测试该功能正确性，推测新调度逻辑导致服务端异常退出。test_lm_head_fp32 亦受影响，可能因服务已崩溃无法响应任何请求。

关键日志:

FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_lm_head_fp32
  http.client.RemoteDisconnected: Remote end closed connection without response
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
  httpx._transports.default - ConnectError
========================= 2 failed in 80.38s (0:01:20) =========================

修复建议:

检查 R3 Overlap Schedule 相关代码变更，排查是否导致推理服务崩溃（查看 tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py 中 test_r3_accuracy 的服务启动逻辑）
在本地复现 test_GLM_45_AIR_mtp_tp4.py 的 4 卡测试，检查服务端日志中是否有 CUDA/NCCL 崩溃信息

修复建议摘要: 检查R3 Overlap Schedule实现，排查GLM推理服务崩溃原因

关联变更: PR 标题 "[RL] R3 Support Overlap Schedule" 与 test_r3_accuracy 直接关联

链接: 查看日志

codecov-commenter · 2026-04-29T14:28:58Z

Codecov Report

❌ Patch coverage is 16.27907% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...model_executor/layers/moe/routing_indices_cache.py	10.34%	26 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py	0.00%	6 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py	50.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7674   +/-   ##
==============================================
  Coverage               ?   72.48%           
==============================================
  Files                  ?      378           
  Lines                  ?    53876           
  Branches               ?     8422           
==============================================
  Hits                   ?    39052           
  Misses                 ?    12048           
  Partials               ?     2776

Flag	Coverage Δ
GPU	`72.48% <16.27%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

into r3_overlap_schedule_2.6

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 19:27:26

📋 Review 摘要

PR 概述：将 R3（Rollout Routing Replay）的 D2H 数据传输从同步阻塞改为异步流水线，兼容 Overlap Schedule
变更范围：fastdeploy/model_executor/layers/moe/、fastdeploy/worker/gpu_model_runner.py、fastdeploy/model_executor/pre_and_post_process.py、custom_ops/gpu_ops/
影响面 Tag：[RL] [OP] [Executor]

📝 PR 规范检查

PR 目标分支为 release/2.6，按规范应使用 Cherry-Pick 格式标题；同时 PR body 中 "Cherry-pick to Develop" 仍为 TODO，违反了"先合入 develop 再 cherry-pick 到 release"的工作流要求。

标题建议（可直接复制）：

[Cherry-Pick][RL] R3 Support Overlap Schedule (#<原 PR 号>)

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
R3 (Rollout Routing Replay) previously performed routing data transfer via a synchronous D2H copy (`gpu_routing_buffer.cpu().numpy()`) inside post_process, which blocked the CPU and was incompatible with Overlap Schedule. This PR redesigns the R3 D2H pipeline to be fully async and overlap-friendly.

## Modifications
- **RoutedExpertsCapturer** (`routing_indices_cache.py`): removed `block_table` dependency and `pending_update_positions`; added `routing_staging_buf`, `slot_mapping_staging_buf` (GPU staging), `cpu_routing_buf`, `cpu_slot_mapping_buf` (CPU pinned), and `_pending_save` state
  - `prepare_pending_save()`: enqueues D2D (GPU → staging) and async D2H (staging → CPU pinned) before `post_process_event.record()`; returns immediately without blocking CPU
  - `flush_pending_save()`: pure CPU scatter from pinned buffers to SharedMemory; called after `post_process_event.synchronize()` in both normal and overlap paths
  - `post_process_normal` / `post_process_speculate`: replaced `compute_slot_mapping_flat + save_captured_routing` with `prepare_pending_save`, using `slot_mapping_buffer` from `share_inputs` directly
- **gpu_model_runner.py**: extended `_compute_position_ids_and_slot_mapping` to run when R3 is enabled; removed pre-forward `get_token_positions` call; added `flush_pending_save()` at synchronize points for both `execute_model_normal` and `execute_model_overlap`
- **Custom Op**: renamed `get_position_ids_and_mask_encoder_batch` → `get_position_ids`; updated `setup_ops.py` (3 locations) and all call sites

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
📝 PR 规范	—	目标分支 `release/2.6`，标题缺少 `[Cherry-Pick]` 前缀；body TODO "Cherry-pick to Develop" 未完成
🟡 建议	`custom_ops/gpu_ops/get_position_ids.cu:19`	`seq_lens_encoder` 参数传入 kernel 但从未使用
🟡 建议	`tests/operators/test_get_position_ids_and_mask_encoder_batch.py:20`	测试文件名未随算子重命名更新

总体评价

异步 D2H 流水线设计思路正确，prepare_pending_save + flush_pending_save 的分离模型清晰地在 CUDA 事件同步点前后划分了 GPU 传输和 CPU 消费。建议补充 PR 规范（Cherry-Pick 格式）并清理新 CUDA kernel 中的无效参数。

PaddlePaddle-bot · 2026-05-07T11:33:13Z

+#include "paddle/extension.h"
+
+__global__ void GetPositionIdsKernel(const int* __restrict__ seq_lens_encoder,
+                                     const int* __restrict__ seq_lens_decoder,


🟡 建议 seq_lens_encoder 参数传入 GetPositionIdsKernel 但在函数体中从未使用（内核中所有偏移量均由 seq_lens_this_time 和 seq_lens_decoder 计算）。

建议删除该参数，或在注释中说明保留原因（如 ABI/API 兼容性）。若保留，建议加 (void)seq_lens_encoder; 以避免编译器警告。

// 当前：参数声明但从未读取 __global__ void GetPositionIdsKernel(const int* __restrict__ seq_lens_encoder, // 未使用 const int* __restrict__ seq_lens_decoder, ...

PaddlePaddle-bot · 2026-05-07T11:33:13Z

 import paddle

-from fastdeploy.model_executor.ops.gpu import get_position_ids_and_mask_encoder_batch
+from fastdeploy.model_executor.ops.gpu import get_position_ids


🟡 建议 文件名 test_get_position_ids_and_mask_encoder_batch.py 未随算子重命名而更新，与新算子名 get_position_ids 不一致。建议将文件重命名为 test_get_position_ids.py 以保持测试文件与被测算子的一致性。

gongshaotian added 4 commits April 28, 2026 20:38

Correct the semantics of max_num_batched_tokens with multi mode

9547f3c

fix D2H bug

10d451c

Merge branch 'release/2.6' of https://github.com/PaddlePaddle/FastDeploy

ee9b1e0

into r3_overlap_schedule_2.6

R3 support Overlap Schedule

adb9597

gongshaotian had a problem deploying to Metax_ci April 29, 2026 13:05 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

gongshaotian added 2 commits April 30, 2026 11:34

Merge branch 'release/2.6' of https://github.com/PaddlePaddle/FastDeploy

2a22b15

into r3_overlap_schedule_2.6

merge release/2.6

2c554f9

gongshaotian had a problem deploying to Metax_ci May 6, 2026 03:39 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

rewrite get_position_id kernel

bed58de

gongshaotian had a problem deploying to Metax_ci May 7, 2026 11:22 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] R3 Support Overlap Schedule#7674

[RL] R3 Support Overlap Schedule#7674
gongshaotian wants to merge 7 commits intoPaddlePaddle:release/2.6from
gongshaotian:r3_overlap_schedule_2.6

gongshaotian commented Apr 29, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 29, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Apr 29, 2026 •

edited

Loading

Approval

Run Four Cards Tests / run_4_cards_tests

Uh oh!

codecov-commenter commented Apr 29, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gongshaotian commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 29, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/10 通过

2.2 可选任务 — 22/26 通过

3 失败详情（仅 required）

Approval

Run Four Cards Tests / run_4_cards_tests

Uh oh!

codecov-commenter commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gongshaotian commented Apr 29, 2026 •

edited

Loading

PaddlePaddle-bot commented Apr 29, 2026 •

edited

Loading

codecov-commenter commented Apr 29, 2026 •

edited

Loading