Skip to content

[RL] R3 Support Overlap Schedule#7674

Draft
gongshaotian wants to merge 7 commits intoPaddlePaddle:release/2.6from
gongshaotian:r3_overlap_schedule_2.6
Draft

[RL] R3 Support Overlap Schedule#7674
gongshaotian wants to merge 7 commits intoPaddlePaddle:release/2.6from
gongshaotian:r3_overlap_schedule_2.6

Conversation

@gongshaotian
Copy link
Copy Markdown
Collaborator

@gongshaotian gongshaotian commented Apr 29, 2026

Motivation

R3 (Rollout Routing Replay) previously performed routing data transfer via a synchronous D2H copy
(gpu_routing_buffer.cpu().numpy()) inside post_process, which blocked the CPU and was incompatible with Overlap
Schedule. This PR redesigns the R3 D2H pipeline to be fully async and overlap-friendly.

TODO:

  • Rename gpu_routing_buffer as device_routing_buffer
  • Reuse the state of radix tree nodes when swapping the KV cache with the routing cache.
  • Cherry-pick to Develop

Modifications

  • RoutedExpertsCapturer: removed block_table dependency and pending_update_positions; added routing_staging_buf,
    slot_mapping_staging_buf (GPU), cpu_routing_buf, cpu_slot_mapping_buf (CPU pinned), and _pending_save state
    • prepare_pending_save(): enqueues D2D (GPU → staging) and async D2H (staging → CPU pinned) on the current CUDA
      stream before post_process_event.record(); returns immediately without blocking the CPU
    • flush_pending_save(): pure CPU scatter from pinned buffers to SharedMemory; called after
      post_process_event.synchronize() in both execute_model and execute_model_overlap, guaranteeing all transfers are
      complete
    • post_process_normal / post_process_speculate: replaced compute_slot_mapping_flat + save_captured_routing with
      prepare_pending_save, using slot_mapping_buffer from share_inputs directly instead of recomputing from positions
    • gpu_model_runner.py: extended _compute_slot_mapping to also run when R3 is enabled (previously only triggered
      for MLA/DSA backends); removed the pre-forward get_token_positions call; added flush_pending_save() at the
      synchronize point for both normal and overlap execution paths

Usage or Command

None

Accuracy Tests

None

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 29, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 29, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 20:13:51

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 有 2 个 Required 任务失败,1 个 Required 任务运行中,需优先处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 29 5 1 1 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 10s PR问题:缺少审批,custom op需FastDeploy+PaddlePaddle RD approve 请 dangqingqing/jiangjiajun 等指定RD进行审批 Job -
Run Four Cards Tests / run_4_cards_tests 12m33s PR问题:test_r3_accuracy/lm_head_fp32 HTTP连接断开,服务疑崩溃 检查R3 Overlap Schedule实现,排查服务端崩溃原因 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - - -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 12m50s Job -
Check PR Template 13s Job -
Trigger Jenkins for PR 39s Job -
⏸️ CI_HPU - - -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范/审批流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR缺少指定RD审批,custom op需FastDeploy+PaddlePaddle RD approve
  • 分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 检测到 3 处审批规则未满足:(1) 添加 custom op 需要至少一名 FastDeploy RD(dangqingqing/jiangjiajun/dengkaipeng)审批;(2) 添加 custom op 还需至少一名 PaddlePaddle RD(gaoxiang/mayongqiang)审批;(3) 合入 release/2.6 分支的 PR 需遵循 Cherry-Pick 流程,PR 标题须含 [Cherry-Pick] 及原 develop PR 编号(如 #5010)。

关键日志:

==> PR title: [RL] R3 Support Overlap Schedule
0. You must have one FastDeploy RD (...) approval for adding custom op.
1. You must have one PaddlePaddle RD (...) approval for adding custom op.
2. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] ...
There are 3 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. qingqing01(dangqingqing)Jiang-Jia-Jun(jiangjiajun)heavengate(dengkaipeng) 中至少一人 approve 本 PR(custom op 要求)
  2. jeff41404(gaoxiang)yongqiangma(mayongqiang) 中至少一人 approve 本 PR(PaddlePaddle RD 要求)
  3. 如需合入 release/2.6,请确认是否需将 PR 改为 Cherry-Pick 形式,标题加 [Cherry-Pick] 及原 develop PR 编号

修复建议摘要: 请指定 FastDeploy RD 和 PaddlePaddle RD 审批,并确认 Cherry-Pick 流程

链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 用例失败(置信度: 中)

Run Four Cards Tests / run_4_cards_tests

  • 状态: ❌ 失败
  • 错误类型: 用例失败
  • 置信度: 中
  • 根因摘要: test_r3_accuracy/lm_head_fp32 HTTP连接断开,R3 Overlap Schedule疑致服务崩溃
  • 分析器: 通用分析(fallback)

失败用例:

测试 错误 根因
test_GLM_45_AIR_mtp_tp4.py::test_lm_head_fp32 http.client.RemoteDisconnected 服务端连接被强制断开
test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy httpx ConnectError 服务端无法建立连接

根因详情:
两个测试均在 HTTP 请求阶段失败,报 RemoteDisconnectedhttpcore connect_tcp 错误,表明推理服务在测试请求时已崩溃或未正常监听。PR 引入了 R3 Overlap Schedule 功能,test_r3_accuracy 直接测试该功能正确性,推测新调度逻辑导致服务端异常退出。test_lm_head_fp32 亦受影响,可能因服务已崩溃无法响应任何请求。

关键日志:

FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_lm_head_fp32
  http.client.RemoteDisconnected: Remote end closed connection without response
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
  httpx._transports.default - ConnectError
========================= 2 failed in 80.38s (0:01:20) =========================

修复建议:

  1. 检查 R3 Overlap Schedule 相关代码变更,排查是否导致推理服务崩溃(查看 tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.pytest_r3_accuracy 的服务启动逻辑)
  2. 在本地复现 test_GLM_45_AIR_mtp_tp4.py 的 4 卡测试,检查服务端日志中是否有 CUDA/NCCL 崩溃信息

修复建议摘要: 检查R3 Overlap Schedule实现,排查GLM推理服务崩溃原因

关联变更: PR 标题 "[RL] R3 Support Overlap Schedule" 与 test_r3_accuracy 直接关联

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 16.27907% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/layers/moe/routing_indices_cache.py 10.34% 26 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py 0.00% 6 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py 50.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7674   +/-   ##
==============================================
  Coverage               ?   72.48%           
==============================================
  Files                  ?      378           
  Lines                  ?    53876           
  Branches               ?     8422           
==============================================
  Hits                   ?    39052           
  Misses                 ?    12048           
  Partials               ?     2776           
Flag Coverage Δ
GPU 72.48% <16.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 19:27:26

📋 Review 摘要

PR 概述:将 R3(Rollout Routing Replay)的 D2H 数据传输从同步阻塞改为异步流水线,兼容 Overlap Schedule
变更范围fastdeploy/model_executor/layers/moe/fastdeploy/worker/gpu_model_runner.pyfastdeploy/model_executor/pre_and_post_process.pycustom_ops/gpu_ops/
影响面 Tag[RL] [OP] [Executor]

📝 PR 规范检查

PR 目标分支为 release/2.6,按规范应使用 Cherry-Pick 格式标题;同时 PR body 中 "Cherry-pick to Develop" 仍为 TODO,违反了"先合入 develop 再 cherry-pick 到 release"的工作流要求。

标题建议(可直接复制):

  • [Cherry-Pick][RL] R3 Support Overlap Schedule (#<原 PR 号>)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
R3 (Rollout Routing Replay) previously performed routing data transfer via a synchronous D2H copy (`gpu_routing_buffer.cpu().numpy()`) inside post_process, which blocked the CPU and was incompatible with Overlap Schedule. This PR redesigns the R3 D2H pipeline to be fully async and overlap-friendly.

## Modifications
- **RoutedExpertsCapturer** (`routing_indices_cache.py`): removed `block_table` dependency and `pending_update_positions`; added `routing_staging_buf`, `slot_mapping_staging_buf` (GPU staging), `cpu_routing_buf`, `cpu_slot_mapping_buf` (CPU pinned), and `_pending_save` state
  - `prepare_pending_save()`: enqueues D2D (GPU → staging) and async D2H (staging → CPU pinned) before `post_process_event.record()`; returns immediately without blocking CPU
  - `flush_pending_save()`: pure CPU scatter from pinned buffers to SharedMemory; called after `post_process_event.synchronize()` in both normal and overlap paths
  - `post_process_normal` / `post_process_speculate`: replaced `compute_slot_mapping_flat + save_captured_routing` with `prepare_pending_save`, using `slot_mapping_buffer` from `share_inputs` directly
- **gpu_model_runner.py**: extended `_compute_position_ids_and_slot_mapping` to run when R3 is enabled; removed pre-forward `get_token_positions` call; added `flush_pending_save()` at synchronize points for both `execute_model_normal` and `execute_model_overlap`
- **Custom Op**: renamed `get_position_ids_and_mask_encoder_batch``get_position_ids`; updated `setup_ops.py` (3 locations) and all call sites

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
📝 PR 规范 目标分支 release/2.6,标题缺少 [Cherry-Pick] 前缀;body TODO "Cherry-pick to Develop" 未完成
🟡 建议 custom_ops/gpu_ops/get_position_ids.cu:19 seq_lens_encoder 参数传入 kernel 但从未使用
🟡 建议 tests/operators/test_get_position_ids_and_mask_encoder_batch.py:20 测试文件名未随算子重命名更新

总体评价

异步 D2H 流水线设计思路正确,prepare_pending_save + flush_pending_save 的分离模型清晰地在 CUDA 事件同步点前后划分了 GPU 传输和 CPU 消费。建议补充 PR 规范(Cherry-Pick 格式)并清理新 CUDA kernel 中的无效参数。

#include "paddle/extension.h"

__global__ void GetPositionIdsKernel(const int* __restrict__ seq_lens_encoder,
const int* __restrict__ seq_lens_decoder,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 seq_lens_encoder 参数传入 GetPositionIdsKernel 但在函数体中从未使用(内核中所有偏移量均由 seq_lens_this_timeseq_lens_decoder 计算)。

建议删除该参数,或在注释中说明保留原因(如 ABI/API 兼容性)。若保留,建议加 (void)seq_lens_encoder; 以避免编译器警告。

// 当前:参数声明但从未读取
__global__ void GetPositionIdsKernel(const int* __restrict__ seq_lens_encoder,  // 未使用
                                     const int* __restrict__ seq_lens_decoder,
                                     ...

import paddle

from fastdeploy.model_executor.ops.gpu import get_position_ids_and_mask_encoder_batch
from fastdeploy.model_executor.ops.gpu import get_position_ids
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 文件名 test_get_position_ids_and_mask_encoder_batch.py 未随算子重命名而更新,与新算子名 get_position_ids 不一致。建议将文件重命名为 test_get_position_ids.py 以保持测试文件与被测算子的一致性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants