Skip to content

[XPU] fix interrupt_requests for mix#7741

Open
cmcamdy wants to merge 1 commit intoPaddlePaddle:developfrom
cmcamdy:fix_interrupt
Open

[XPU] fix interrupt_requests for mix#7741
cmcamdy wants to merge 1 commit intoPaddlePaddle:developfrom
cmcamdy:fix_interrupt

Conversation

@cmcamdy
Copy link
Copy Markdown
Collaborator

@cmcamdy cmcamdy commented May 7, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@cmcamdy cmcamdy deployed to Metax_ci May 7, 2026 11:41 — with GitHub Actions Active
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 20:05:29

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 Required 任务失败(Approval),另有 6 个 Required 任务运行中,1 个等待中,需等待结果并处理 Approval 审批问题。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
34(0) 34 22 2 7 3 0

2 任务状态汇总

2.1 Required任务 : 1/9 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:新增logger调用,需指定RD审批日志行为 请 xyxinyang 或 zyyzghb 审批本PR Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
Run Base Tests / base_tests - 运行中 - Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Run Stable Tests / stable_tests - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 1 个必选任务通过(Run FastDeploy LogProb Tests / run_tests_logprob - - - - -

2.2 可选任务 — 21/25 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 13s Job -
Trigger Jenkins for PR - Job -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 21 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR新增logger.info/error调用,需指定RD审批日志行为变更
  • 分析器: 通用分析(fallback)

根因详情:
PR #7741interrupt_requests 处理逻辑中新增了3处日志调用(2个logger.info + 1个logger.error),触发了 FastDeploy 仓库的日志行为审批规则。根据 check_approval.sh 脚本,修改日志行为(.info/.debug/.error/log_request)时,必须获得指定 RD 的 PR 审批才能通过此检查。

关键日志:

Detected log modification in diff:
+    logger.info(f"Processing interrupt_requests for req_ids: {req_ids}")
+    logger.info(f"interrupt_requests completed for req_ids: {req_ids}, result: {abort_result}")
+    logger.error(...)
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval
   for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.

修复建议:

  1. 请指定审批人 xyxinyang(zhouchong)zyyzghb(zhangyongyue) 在 GitHub 上对本 PR 进行 Approve 操作

修复建议摘要: 请 xyxinyang 或 zyyzghb 审批本PR的日志行为变更

关联变更: PR在XPU interrupt_requests处理逻辑中新增了logger.info和logger.error日志调用
链接: 查看日志

req_ids = task.get("req_ids", [])
logger.info(f"Processing interrupt_requests for req_ids: {req_ids}")
try:
abort_result = self.engine._control_abort_requests(
Copy link
Copy Markdown
Collaborator

@qwes5s5 qwes5s5 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 20:07:49

📋 Review 摘要

PR 概述:在 PD 分离 mix 模式下,为 _recv_external_module_control_instruct 补充缺失的 interrupt_requests 控制命令处理分支。
变更范围fastdeploy/splitwise/internal_adapter_utils.py
影响面 Tag[PD Disaggregation] [XPU]

📝 PR 规范检查

PR 标题 [XPU] tag 对应的是 XPU 硬件特定文件(xpu_worker.py 等),而本次变更文件位于 fastdeploy/splitwise/,根据架构文档映射为 [PD Disaggregation] 影响面,Tag 与实际变更模块不完全匹配;且 PR 描述各章节均为模板占位符,需补充完整。

标题建议(可直接复制):

  • [BugFix] fix interrupt_requests handler in splitwise internal adapter for XPU mix mode

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
在 PD 分离(mix mode)的 XPU 场景中,`internal_adapter_utils.py` 缺少对 `interrupt_requests` 控制命令的处理分支,导致外部模块发送请求中断指令时无法被正确执行和回复。

## Modifications
- `fastdeploy/splitwise/internal_adapter_utils.py`:在 `_recv_external_module_control_instruct` 方法中新增 `interrupt_requests` 命令处理分支,通过调用 `engine._control_abort_requests` 执行请求中断,并将执行结果通过 `recv_control_cmd_server.response_for_control_cmd` 回复给调用方;异常情况下捕获错误并记录日志后仍回复错误信息。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/splitwise/internal_adapter_utils.py:112 type() 动态构造伪 ControlRequest 对象,建议改用项目已有的正式 ControlRequest

总体评价

功能补全思路正确,异常处理和日志完善。主要建议是将 type() 动态构造方式替换为项目正式的 ControlRequest 类,代码更清晰且与其他调用处保持一致;PR 描述需补充完整。

logger.info(f"Processing interrupt_requests for req_ids: {req_ids}")
try:
abort_result = self.engine._control_abort_requests(
type("ControlRequest", (), {"get_args": lambda _: {"req_ids": req_ids}})()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议type() 动态构造伪 ControlRequest 对象是一种 hack 写法,不易维护。

项目中已有正式的 ControlRequest 类(fastdeploy/engine/request.py),建议直接使用:

from fastdeploy.engine.request import ControlRequest

abort_result = self.engine._control_abort_requests(
    ControlRequest(
        request_id=task_id_str,
        method="interrupt_requests",
        args={"req_ids": req_ids}
    )
)

这样代码更清晰可读,且与其他调用处(如 api_server.pyabort_requests 的实现方式)保持一致。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d70f33d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/splitwise/internal_adapter_utils.py 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7741   +/-   ##
==========================================
  Coverage           ?   71.60%           
==========================================
  Files              ?      396           
  Lines              ?    55581           
  Branches           ?     8689           
==========================================
  Hits               ?    39799           
  Misses             ?    13043           
  Partials           ?     2739           
Flag Coverage Δ
GPU 71.60% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants