Skip to content

[BugFix] Fix timeout and hang issues in the pause interface during PD separation within the refactored abort_requests and pause APIs#7837

Open
qwes5s5 wants to merge 1 commit into
PaddlePaddle:developfrom
qwes5s5:fix_refact_abort
Open

[BugFix] Fix timeout and hang issues in the pause interface during PD separation within the refactored abort_requests and pause APIs#7837
qwes5s5 wants to merge 1 commit into
PaddlePaddle:developfrom
qwes5s5:fix_refact_abort

Conversation

@qwes5s5
Copy link
Copy Markdown
Collaborator

@qwes5s5 qwes5s5 commented May 16, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

In splitwise (PD disaggregated) architecture, when requests are aborted or rejected by paused gate, the following issues occur:

  1. Ghost requests blocking drain: Requests rejected by prefill side still retain scheduler entries on decode side, becoming "ghost" requests that cause _wait_inflight_drained to hang indefinitely.
  2. Late-arrived requests not handled: Requests arriving during drain phase are not properly added to abort set.
  3. Ghost prefilled outputs not recycled: Some prefilled outputs have no corresponding scheduler entry but are not cleaned up.

These issues prevent engine shutdown or pause operations from completing normally.

Modifications

1. Add drop signal mechanism (splitwise_connector.py)

  • Add send_drop_signal() method: prefill side notifies decode side that a request has been dropped
  • Add _handle_drop() method: decode side handles drop signal, puts decode_drop message into engine worker queue
  • Support drop message type for prefill/decode communication

2. Handle decode_drop message in engine (common_engine.py)

  • Add decode_drop message handling branch in _handle_disaggregated_tasks()
  • Synthesize RequestOutput with error_code=499 upon receiving drop signal, following normal _recycle path to reclaim scheduler entry

3. Enhance _wait_inflight_drained() robustness

  • Late-arrived request handling: Automatically add requests arriving during drain phase to abort set
  • Ghost cleanup: Reap scheduler-only ghost requests after 30 seconds to avoid indefinite blocking

4. Fix ghost recycling in _process_prefilled_request_outputs()

  • Detect prefilled outputs not registered in scheduler but marked for abort
  • Call pre_recycle_resource() to clean up resources, remove from abort sets and tokens_counter

5. Send drop signal on prefill abort

  • When prefill role receives abort request, notify decode side via split_connector

Usage or Command

This fix affects splitwise deployment mode, no additional commands required. Verification steps:

  1. Start splitwise service
  2. Send requests and abort during prefill phase or trigger paused gate
  3. Observe that engine shutdown/pause completes normally without drain timeout

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-16 21:30:03

📋 Review 摘要

PR 概述:修复 PD 分离(splitwise)架构下 pause/abort 操作因幽灵请求导致 _wait_inflight_drained 永久阻塞的多个场景,包括 ghost 调度器条目清理、late-arrived 请求处理和跨侧 drop 信号机制。

变更范围fastdeploy/engine/common_engine.pyfastdeploy/splitwise/splitwise_connector.py

影响面 Tag[Engine] [PD Disaggregation]

问题

级别 文件 概述
🟡 建议 fastdeploy/engine/common_engine.py:1551 start_time 变量复用导致 ghost reap 后 warning 日志 elapsed 时间不准确
📝 PR 规范 ## Accuracy Tests 段落内容为空(仅模板注释),Checklist 项目均未勾选

📝 PR 规范检查

PR 标题格式合规(含官方 [BugFix] Tag)。但 PR 描述存在结构问题:## Accuracy Tests 段落仅保留模板注释(实质为空,应填 N/A),且 Checklist 所有条目均为 [ ](至少第 1、4 条应勾选)。

标题建议(当前标题已合规,无需修改)

PR 描述建议(可直接复制,修复空段落与 Checklist):

## Motivation

In splitwise (PD disaggregated) architecture, when requests are aborted or rejected by the paused gate, the following issues occur:

1. **Ghost requests blocking drain**: Requests rejected by prefill side still retain scheduler entries on decode side, becoming "ghost" requests that cause `_wait_inflight_drained` to hang indefinitely.
2. **Late-arrived requests not handled**: Requests arriving during drain phase are not properly added to abort set.
3. **Ghost prefilled outputs not recycled**: Some prefilled outputs have no corresponding scheduler entry but are not cleaned up.

These issues prevent engine shutdown or pause operations from completing normally.

## Modifications

1. **splitwise_connector.py**: Add `send_drop_signal()` method and `_handle_drop()` handler; prefill side notifies decode side when a request is dropped via the paused gate. Add "drop" message type routing in `_process_message`.
2. **common_engine.py** (`_handle_disaggregated_tasks`): Add `decode_drop` message handling — synthesize `RequestOutput` with error_code=499 so the decode side recycles its scheduler entry via the normal `put_results → _recycle` path.
3. **common_engine.py** (`_wait_inflight_drained`): Add late-arrived request detection (adds them to abort set on each iteration); add scheduler-only ghost cleanup after 30 s to prevent indefinite blocking.
4. **common_engine.py** (`_process_prefilled_requests`): Detect prefilled outputs not registered in scheduler but marked for abort; call `pre_recycle_resource()` and clean up abort sets and tokens_counter to break the deadlock.
5. **common_engine.py** (`_insert_zmq_task_to_scheduler`): When prefill role rejects a request through the paused gate and `disaggregate_info` is present, call `send_drop_signal()` to prevent ghost scheduler entries on the decode side.

## Usage or Command

This fix affects splitwise deployment mode only. No additional configuration required.

Verification steps:
1. Start splitwise service
2. Send requests and abort during prefill phase or trigger paused gate
3. Observe that engine shutdown/pause completes normally without drain timeout

## Accuracy Tests

N/A — this change fixes abort/drain logic only and does not affect model outputs.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体修复思路清晰,五个并发场景均有对应处理,decode_drop 消息机制和 ghost 清理逻辑设计合理。_wait_inflight_drainedstart_time 变量复用问题建议修复,以确保生产环境日志诊断的准确性。

f"{scheduler_only_ids}"
)
# Reset to avoid re-reaping on the next tick
start_time = now
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 start_time 变量被同时用于「ghost reap 冷却计时」和「drain 总 elapsed 日志」,重置后日志准确性下降。

第一次 ghost reap 发生在 T=30s,start_time = now 被重置;此后每 30 秒 warning 日志中输出的 elapsed 是「距上次 reap 的时间」而非「drain 开始以来的总耗时」,会严重干扰生产故障排查。

建议用独立变量分别计时:

start_time = time.monotonic()          # 不再重置,用于 elapsed 日志
next_warn_time = start_time + 30
GHOST_REAP_AFTER = 30.0
ghost_reap_start = start_time          # 独立 ghost reap 冷却计时器

while ...:
    now = time.monotonic()
    ...
    if now - ghost_reap_start >= GHOST_REAP_AFTER:
        ...
        # Reset to avoid re-reaping on the next tick
        ghost_reap_start = now          # 只重置 reap 计时器

    if now >= next_warn_time:
        self.llm_logger.warning(
            f"elapsed: {now - start_time:.3f} seconds, ..."  # 始终显示总耗时
        )

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 16, 2026

Codecov Report

❌ Patch coverage is 8.19672% with 56 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@12c6ae0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 7.31% 34 Missing and 4 partials ⚠️
fastdeploy/splitwise/splitwise_connector.py 10.00% 17 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7837   +/-   ##
==========================================
  Coverage           ?   63.22%           
==========================================
  Files              ?      462           
  Lines              ?    64336           
  Branches           ?     9864           
==========================================
  Hits               ?    40675           
  Misses             ?    20892           
  Partials           ?     2769           
Flag Coverage Δ
GPU 72.31% <8.19%> (?)
XPU 7.11% <3.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 16, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 13:36:26

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

2 个 required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 38 3 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h21m PR问题:diff覆盖率仅16%,未达80%阈值 补充 common_engine.py/splitwise_connector.py 的单元测试 Job -
Approval 8s PR问题:修改logging行为需特定RD审批 请 xyxinyang 或 zyyzghb 在 PR 上批准 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 30/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
CI_HPU 1h4m Job -
其余 30 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率检查失败(置信度: 高)

run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率检查失败
  • 置信度: 高
  • 根因摘要: PR新增代码diff覆盖率仅16%,远低于80%阈值
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例: 无(单元测试全部通过,失败原因为覆盖率不足)

根因详情:
本次 PR 变更共涉及 61 行新增代码,但仅 10 行有测试覆盖(总覆盖率 16%),低于 80% 阈值导致任务失败。主要问题文件为 fastdeploy/engine/common_engine.py(覆盖率 17.07%,34 行未覆盖)和 fastdeploy/splitwise/splitwise_connector.py(覆盖率 15%,17 行未覆盖),对应 PR 中新增的 abort/pause 相关逻辑。

关键日志:

Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
  fastdeploy/engine/common_engine.py: 17.07% (34 violations, lines 1352-1357,1516-1551,2036-2152)
  fastdeploy/splitwise/splitwise_connector.py: 15.0% (17 violations, lines 244-256,405,442-447)
  total_percent_covered: 16%  (51/61 lines uncovered)
##[error]Process completed with exit code 9.

修复建议:

  1. fastdeploy/engine/common_engine.py 第 1352–1357、1516–1551、2036–2152 行新增单元测试,覆盖 abort/pause 相关新逻辑
  2. fastdeploy/splitwise/splitwise_connector.py 第 244–256、405、442–447 行补充测试用例
  3. 若短期无法补充测试,可按项目规范申请覆盖率豁免

修复建议摘要: 为 common_engine.py 和 splitwise_connector.py 新增单元测试

关联变更: fastdeploy/engine/common_engine.py L1352-2152(abort/pause 逻辑),fastdeploy/splitwise/splitwise_connector.py L244-447
链接: 查看日志

Approval — PR审批检查失败(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批检查
  • 置信度: 高
  • 根因摘要: PR新增logging调用,需xyxinyang或zyyzghb审批
  • 分析器: ci_analyze_infra

关键日志:

Detected log modification in diff:
+  self.llm_logger.info(f"Pause drain: late-arrived requests added to abort set: {late_ids}")
+  self.logger.info(f"send_drop_signal: addr={addr}, request_id={request_id}")
+  self.logger.info(f"_handle_drop: request_id={request_id}")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval
   for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 在 GitHub PR 页面 Request Review,邀请 xyxinyang(zhouchong)zyyzghb(zhangyongyue) 审批并批准此 PR

修复建议摘要: 邀请 xyxinyang 或 zyyzghb 审批 PR

链接: 查看日志

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 03:29:15

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 有 2 个 Required 任务失败,阻塞合并;另有 1 个可选任务失败(不影响合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 33 3 0 0 0

2 任务状态汇总

2.1 Required 任务 : 7/9 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:PR新增.info日志语句,需指定RD审批 请求 xyxinyang/zyyzghb approve Job -
run_tests_with_coverage 1h21m PR问题:新增代码覆盖率未达80%阈值 添加单元测试或申请覆盖率豁免 Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 26/27 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
CI_HPU 1h4m Job -
其余 26 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR审批流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR审批流程
  • 置信度: 高
  • 根因摘要: PR新增.info日志语句,需指定RD审批
  • 分析器: ci_analyze_infra

关键日志:

Detected log modification in diff:
+  self.llm_logger.info(f"Pause drain: late-arrived requests added to abort set: {late_ids}")
+  self.llm_logger.info(...)
+  self.logger.info(f"send_drop_signal: addr={addr}, request_id={request_id}")
+  self.logger.info(f"_handle_drop: request_id={request_id}")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval
   for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.

根因详情:
本次 PR 在 pause/abort 相关代码中新增了多条 .info 日志语句。check_approval.sh 检测到 diff 中存在日志行为修改,按规定必须获得指定 FastDeploy RD 的明确 approve 后 CI 才能通过。exit code 6 表示 1 个审批项尚未满足。

修复建议:

  1. 请在 PR 中 @ xyxinyang(zhouchong)zyyzghb(zhangyongyue),请求其 review 并对日志修改进行 approve

修复建议摘要: 请求 xyxinyang/zyyzghb review 并 approve

链接: 查看日志

run_tests_with_coverage — 覆盖率不达标(置信度: 中)

run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 中(日志下载失败,基于步骤信息推断)
  • 根因摘要: PR新增代码覆盖率未达80%阈值,单测全部通过
  • 分析器: ci_analyze_unittest_fastdeploy

失败步骤: Verify Code Coverage Threshold (80%) — exit code 9

根因详情:
本 Job 共 8 个步骤,"Run FastDeploy Unit Tests and Coverage"(步骤3)、"Upload coverage"(步骤4)、"Check Unit Test Success"(步骤5)均成功,说明所有单元测试通过。最终在"Verify Code Coverage Threshold (80%)"(步骤6)以 exit code 9 失败,说明 PR 新增/修改代码行的增量覆盖率低于 80% 要求。

关键日志:

Step 3: Run FastDeploy Unit Tests and Coverage — success
Step 5: Check Unit Test Success              — success
Step 6: Verify Code Coverage Threshold (80%) — failure (exit code 9)

修复建议:

  1. 为 PR 新增的 pause/abort 相关代码(send_drop_signal、_handle_drop 等函数)添加单元测试,使增量覆盖率达到 80%
  2. 如果这些代码路径难以单测,可在 check_cov_skip / Check bypass 处申请覆盖率豁免

修复建议摘要: 为新增的pause/abort代码添加单元测试

链接: 查看日志

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 07:12:11

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ CI 状态获取失败:连续多次调用 GitHub API 均超时(TLS handshake timeout / 命令超时),无法拉取当前 CI 状态。

请直接前往 CI 详情页 查看最新状态。


2 说明

本次自动化 CI 状态分析因网络问题未能完成,后续将在下一个更新周期(约 30 分钟后)重新尝试。

如需立即了解 CI 状态,请:

  1. 访问 PR 检查页面
  2. 或在 PR 评论中手动触发重新分析

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants