Skip to content

[Cherry-Pick][BugFix] Fix timeout and hang issues in the pause interface during PD separation within the refactored abort_requests and pause APIs(#7837)#7838

Open
qwes5s5 wants to merge 1 commit into
PaddlePaddle:release/2.6from
qwes5s5:test_abort_2.6
Open

[Cherry-Pick][BugFix] Fix timeout and hang issues in the pause interface during PD separation within the refactored abort_requests and pause APIs(#7837)#7838
qwes5s5 wants to merge 1 commit into
PaddlePaddle:release/2.6from
qwes5s5:test_abort_2.6

Conversation

@qwes5s5
Copy link
Copy Markdown
Collaborator

@qwes5s5 qwes5s5 commented May 16, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 16, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-16 21:40:40

📋 Review 摘要

PR 概述:修复 PD 分离架构下 pause 接口因 ghost 请求导致的超时/挂起问题
变更范围fastdeploy/engine/common_engine.pyfastdeploy/splitwise/splitwise_connector.py
影响面 Tag[Engine] [PD Disaggregation] [BugFix]

问题

级别 文件 概述
🟡 建议 fastdeploy/engine/common_engine.py:1923 直接操作 resource_manager 内部集合,绕过 lock
📝 PR 规范 PR 描述所有章节均为空,未填写任何实质内容

📝 PR 规范检查

PR body 中 MotivationModificationsUsage or CommandAccuracy Tests 均为空,Checklist 全部未勾选,不符合 §D2 模板要求。标题格式符合 Cherry-Pick 规范,无需修改。

PR 描述建议(可直接复制):

## Motivation

在 PD 分离(Splitwise)部署场景下,当 Prefill 实例触发 pause 时存在两类挂起问题:
1. Prefill 侧:`_wait_inflight_drained` 因 scheduler-only ghost 请求(已在 `scheduler.requests` 但不在 `resource_manager.requests`)无法清理而永久等待;
2. Decode 侧:Prefill pause gate 拒绝的请求已注册到 Decode 调度器,但永远不会收到 first token,导致 Decode pause/abort drain 阻塞。

## Modifications

- `fastdeploy/engine/common_engine.py`- `_insert_zmq_task_to_scheduler`:Prefill 实例暂停时,通过 `split_connector.send_drop_signal` 主动通知 Decode 侧回收对应 scheduler 条目
  - `_wait_inflight_drained`:新增 late-arrived 请求自动加入 abort set;30s 超时后强制清理 scheduler-only ghost 请求
  - `_fetch_requests`(Decode 侧):处理新 `decode_drop` 消息类型,合成 `finished=True``RequestOutput` 走正常回收路径
  - `_process_prefilled_requests`:已在 abort set 中的 prefilled ghost 输出触发提前回收,打破死锁
- `fastdeploy/splitwise/splitwise_connector.py`:新增 `send_drop_signal` 方法和 `_handle_drop` 处理器,实现 P→D drop 通知协议

## Usage or Command

N/A

## Accuracy Tests

N/A(并发/挂起修复,不影响模型精度)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复逻辑清晰,P→D drop 通知协议设计合理,三处兜底路径(drop signal、_wait_inflight_drained ghost reap、prefilled ghost reap)形成多层防护,有效解决 PD 分离 pause 挂起问题。有一处代码封装问题建议改进,PR 描述需补充完整。

self.resource_manager.pre_recycle_resource(req_id)
except Exception as e:
self.llm_logger.warning(f"pre_recycle_resource({req_id}) failed: {e}")
self.resource_manager.waiting_abort_req_id_set.discard(req_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _process_prefilled_requests 中直接操作 resource_manager 内部集合,绕过封装和锁

resource_manager_v1.py 中对 waiting_abort_req_id_set / to_be_aborted_req_id_set 的修改均在 self.lock 保护下进行(参见 recycle_abort_taskadd_abort_req_ids 等方法)。此处直接调用 .discard() 未持有锁,破坏了封装一致性;在高并发场景下存在潜在竞态。

建议在 resource_manager_v1.py 中新增 recycle_ghost_resource 方法,将 pre_recycle_resource + 两个 set 的 discard 合并到一次持锁操作中:

def recycle_ghost_resource(self, request_id: str):
    """Recycle a ghost request that was never registered in scheduler."""
    with self.lock:
        if request_id in self.requests:
            req = self.requests[request_id]
            self.tasks_list[req.idx] = None
            self.stop_flags[req.idx] = True
            self._free_blocks(req)
            del self.requests[request_id]
            if request_id in self.req_dict:
                del self.req_dict[request_id]
        self.waiting_abort_req_id_set.discard(request_id)
        self.to_be_aborted_req_id_set.discard(request_id)

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 16, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 11:56:03

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

1 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 31 5 0 0 0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h12m PR问题:新增135行代码,Diff覆盖率仅16%,未达80%阈值 为send_drop_signal和ghost reaping逻辑添加单测 Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 19m41s Job -
Check PR Template 13s Job -
CI_HPU 1h5m Job -
Trigger Jenkins for PR 1m14s Job -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 高
  • 根因摘要: PR新增135行代码,Diff覆盖率仅16%,未达80%阈值
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例: 无(所有单元测试通过,失败原因为覆盖率检查)

根因详情:
本次 PR 在 common_engine.pysplitwise_connector.py 中新增了 135 行代码,主要涉及 PD 分离中的 send_drop_signal_handle_drop_wait_inflight_drained 优化等功能。PR 未添加对应的单元测试,导致 Diff 代码覆盖率仅为 16%(61 行变更行中仅 10 行被测试覆盖),远低于 80% 的强制阈值,CI 以退出码 9 失败。

关键日志:

COVERAGE_EXIT_CODE: 9
GPU Patch Coverage Details:
  total_percent_covered: 16%  (total_num_lines: 61, violations: 51)
  fastdeploy/engine/common_engine.py:          17.07%  (34行未覆盖: L1140-1145, L1301-1336, L1810-1825, L1896-1926)
  fastdeploy/splitwise/splitwise_connector.py: 15.0%   (17行未覆盖: L246-258, L407, L444-449)
##[error]Process completed with exit code 9.

修复建议:

  1. fastdeploy/splitwise/splitwise_connector.py L246-258 send_drop_signal 和 L444-449 _handle_drop 添加单元测试
  2. fastdeploy/engine/common_engine.py L1810-1825 decode_drop 处理逻辑及 L1896-1926 ghost reaping 添加单元测试
  3. 若场景难以在单测中复现,可在 PR 描述中说明原因并申请覆盖率豁免

修复建议摘要: 为send_drop_signal和ghost reaping逻辑添加单测或申请豁免

关联变更: fastdeploy/engine/common_engine.py(+116行)、fastdeploy/splitwise/splitwise_connector.py(+19行)

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 8.19672% with 56 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@d71bdda). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 7.31% 34 Missing and 4 partials ⚠️
fastdeploy/splitwise/splitwise_connector.py 10.00% 17 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7838   +/-   ##
==============================================
  Coverage               ?   72.95%           
==============================================
  Files                  ?      381           
  Lines                  ?    54204           
  Branches               ?     8470           
==============================================
  Hits                   ?    39546           
  Misses                 ?    11878           
  Partials               ?     2780           
Flag Coverage Δ
GPU 72.95% <8.19%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 03:25:24

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 Required 任务失败(覆盖率检查未通过),阻塞合并,需优先处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
35(0) 35 30 5 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/9 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
run_tests_with_coverage 1h12m PR问题:新增代码未覆盖,差异覆盖率低于80%阈值 为新增方法添加单测或申请豁免 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 19m41s Job -
Check PR Template 13s Job -
CI_HPU 1h5m Job -
Trigger Jenkins for PR 1m14s Job -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 高
  • 根因摘要: 新增代码未覆盖,差异覆盖率低于80%阈值
  • 分析器: ci_analyze_unittest_fastdeploy

根因详情:
PR 向 fastdeploy/engine/common_engine.py 新增约114行代码(ghost reap 逻辑、decode_drop 消息处理、_process_prefilled_requests 改造),向 fastdeploy/splitwise/splitwise_connector.py 新增 send_drop_signal()_handle_drop() 两个新方法(约21行)。这些新代码路径均未添加对应单元测试,导致差异代码覆盖率低于 80% 阈值,触发 CI 以 exit code 9 退出。

关键日志:

Step: Verify Code Coverage Threshold (80%) — ❌ FAILED
[FAILURE]: Process completed with exit code 9.

修复建议:

  1. fastdeploy/splitwise/splitwise_connector.py 中新增的 send_drop_signal()_handle_drop() 方法编写单元测试
  2. fastdeploy/engine/common_engine.py_wait_inflight_drained() 的 ghost reap 逻辑及 decode_drop 消息处理路径编写单元测试
  3. 若当前版本暂无法覆盖,可在 PR 描述中说明原因并向 CI 申请覆盖率豁免

修复建议摘要: 为 send_drop_signal()_handle_drop() 等新增方法添加单测

关联变更:

  • fastdeploy/engine/common_engine.py_insert_zmq_task_to_scheduler()(L1129+)、_wait_inflight_drained()(L1294+)、_fetch_requests()(L1799+)、_process_prefilled_requests()(L1888+)
  • fastdeploy/splitwise/splitwise_connector.pysend_drop_signal()(L236+)、_handle_drop()(L435+)

链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants