[XPU] add build_sampling_params op. by Jiajun-Ji · Pull Request #7738 · PaddlePaddle/FastDeploy

Jiajun-Ji · 2026-05-07T10:44:54Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

将XPU下的padding_sampling_params的py实现改为XPU kernel实现build_sampling_params，此外将infer_seed更新收敛到build_sampling_params内部，并将infer_seed的increment_value步进对齐GPU实现。

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T10:45:03Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 为 XPU 后端新增 build_sampling_params 自定义算子，用 XPU kernel 替换原先 Python 的 sampling 参数 padding 逻辑，并尝试将 infer_seed 的更新收敛到算子内部，以对齐 GPU 的 seed 步进策略（尤其在 speculative decoding 场景）。

Changes:

新增 XPU build_sampling_params kernel + plugin wrapper + Paddle static op，并在 XPU speculative verify（TARGET_MATCH）路径中接入。
XPU ModelRunner 侧引入 increment_value（对齐 GPU：非 speculative 为 4，speculative 为 (num_speculative_tokens+1)*4），并调整 infer_seed 的更新时机。
新增 custom_ops/xpu_ops/test/test_build_sampling_params.py 单测，对比 Python 参考实现并覆盖多种 batch 形态与 wrap-around。

PR 元信息检查（需补充）

标题已包含 [XPU] tag，格式符合要求。
描述中 “Modifications / Usage or Command / Accuracy Tests” 等小节未补全；若该算子会影响采样结果或可复现性，建议补充 accuracy 对比与对应运行命令/环境信息；如不加单测或无法跑到 XPU CI，也需注明原因（本 PR 已新增单测文件，但仍建议在描述里给出如何运行）。

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fastdeploy/worker/xpu_model_runner.py	计算并下发 `increment_value`，并调整 speculative 场景下 `infer_seed` 的更新逻辑
fastdeploy/model_executor/layers/sample/sampler.py	XPU verify(TARGET_MATCH) 路径改用 `build_sampling_params`，并透传 `increment_value`
custom_ops/xpu_ops/test/test_build_sampling_params.py	新增 XPU op 单测，与 Python 参考实现对齐校验
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp	新增 plugin wrapper（CPU + XPU3 分发）
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu	新增 Kunlun3 XPU kernel 实现
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h	导出 `build_sampling_params` 声明
custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc	新增 Paddle static op 注册与调用桥接

            # 7. Updata 'infer_seed' and step_paddle()
-            self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:


+                share_inputs["seq_lens_this_time"],
+                share_inputs["seq_lens_encoder"],
+                token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
+                increment_value=increment_value,


+  api::Context* ctx = xpu_ctx->x_context();
+  if (top_p.is_cpu()) {
+    ctx = new api::Context(api::kCPU);


+  // Shared prefix-sum buffer: each cluster computes its own pad_start via
+  // a two-pass scan over seq_lens_this_time / seq_lens_encoder.
+  // We use a simple approach: core 0 of cluster 0 writes per-batch start
+  // offsets into a global scratch area is not available here, so instead we
+  // compute pad_start with a sequential scan in core 0 of each cluster.
+  // Because clusters run concurrently we cannot share a global accumulator;
+  // instead each cluster independently sums the first `bi` entries.
+  // This is O(bs) per cluster but bs is typically small (<=512).


PaddlePaddle-bot · 2026-05-07T11:25:19Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 20:02:44

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 61f7d08
Merge base: 45350ff (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 存在 2 个 required 失败任务，4 个 required 任务运行中，1 个 required 任务等待中，建议优先处理 required 失败任务。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	25	5	4	2	0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Pre Commit`	43s	PR问题：sampler.py 中 top_p/top_k/topp_seed 未定义	修复 sampler.py L1078-1081 处未定义变量	Job	-
❌	`Approval`	8s	环境问题：PR 审批未通过（exit code 6）	请相关 Reviewer 完成审批	Job	-
⏳	`run_ce_cases`	-	运行中	-	Job	-
⏳	`run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
✅	其余 3 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	10m33s	Job	-
❌	`Check PR Template`	17s	Job	-
❌	`Trigger Jenkins for PR`	23m37s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Pre Commit — 代码规范（置信度: 高）

Pre Commit

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: sampler.py 中 top_p/top_k/topp_seed 未定义（F821）
分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/model_executor/layers/sample/sampler.py，在 L1078-1081 处使用了变量 top_p、top_k、topp_seed，但这些变量未在当前作用域中定义（Ruff/flake8 F821 错误）。可能是变量未正确声明，或在条件分支中定义但当前代码路径未覆盖。

关键日志:

fastdeploy/model_executor/layers/sample/sampler.py:1078:19: F821 undefined name 'top_p'
fastdeploy/model_executor/layers/sample/sampler.py:1079:19: F821 undefined name 'top_k'
fastdeploy/model_executor/layers/sample/sampler.py:1081:23: F821 undefined name 'topp_seed'

修复建议:

检查 fastdeploy/model_executor/layers/sample/sampler.py L1078-1081，确保 top_p、top_k、topp_seed 在使用前已定义
本地运行验证：pre-commit run --files fastdeploy/model_executor/layers/sample/sampler.py

修复建议摘要: 修复 sampler.py L1078-1081 处未定义变量

关联变更: fastdeploy/model_executor/layers/sample/sampler.py（PR 变更文件）
链接: 查看日志

Approval — 流程问题（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: PR 审批未通过，未达到 required reviewer 审批要求
分析器: 通用分析(fallback)

根因详情:
Approval workflow 以 exit code 6 退出，表示 PR 尚未获得所需的 Reviewer 审批。这是仓库 Branch Protection 规则中配置的审批流程检查，与 PR 代码内容本身无关。

关键日志:

Process completed with exit code 6.

修复建议:

请相关 Reviewer 在 GitHub 上对此 PR 完成代码审查并点击 Approve

修复建议摘要: 请相关 Reviewer 完成 PR 审批

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 19:38:42

📋 Review 摘要

PR 概述：将 XPU 下 padding_sampling_params 的 Python 实现改为 XPU kernel 实现 build_sampling_params，并将 infer_seed 更新逻辑收敛至 kernel 内部，对齐 GPU 的 increment_value 步进。
变更范围：custom_ops/xpu_ops/、fastdeploy/model_executor/layers/sample/sampler.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag：[XPU] [OP]

📝 PR 规范检查

标题 [XPU] add build_sampling_params op. 含有效 Tag，格式合规。但 PR 描述中 ## Modifications 和 ## Usage or Command 为空，且 Checklist 均未勾选，不符合模板要求。

标题建议（可直接复制）：

[XPU][OP] Add build_sampling_params XPU kernel op

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
将XPU下的 `padding_sampling_params` 的 Python 实现改为 XPU kernel 实现 `build_sampling_params`，同时将 `infer_seed` 的更新逻辑收敛到 `build_sampling_params` 内部，并将 `infer_seed` 的 `increment_value` 步进对齐 GPU 实现（speculative decoding 场景为 `(num_speculative_tokens + 1) * 4`，普通推理为 `4`）。

## Modifications
- `custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`：新增 XPU custom op 的 Paddle 注册入口
- `custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`：新增 `build_sampling_params` 函数声明
- `custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`：新增 KunlunXPU3 上的 kernel 实现
- `custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`：新增 CPU 参考实现和 XPU3 wrapper
- `custom_ops/xpu_ops/test/test_build_sampling_params.py`：新增单元测试（纯 decoder、纯 encoder、混合、单 item、seed 环绕 6 种场景）
- `fastdeploy/model_executor/layers/sample/sampler.py`：`_verify_and_sample_xpu` 改为调用 `build_sampling_params`，移除 `_normal_sample_xpu` 中的 `padding_sampling_params` 调用
- `fastdeploy/worker/xpu_model_runner.py`：`increment_value` 根据是否投机解码动态计算；speculative decoding 路径不再在外部更新 `infer_seed`

## Usage or Command
N/A

## Accuracy Tests
精度对比截图见 PR 附图，XPU kernel 输出与 Python 参考实现一致。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/sample/sampler.py:1073`	`_normal_sample_xpu` 使用未定义变量 `top_p`/`top_k`/`topp_seed`，XPU NAIVE 模式必崩
🔴 Bug	`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:38`	`new api::Context(api::kCPU)` 无对应 `delete`，每次 CPU 路径调用均内存泄漏
📝 误提交	`benchmarks/error_output.txt:1`	benchmark 错误日志文件不应提交入库

总体评价

XPU kernel 设计思路合理，CPU 参考实现与 XPU3 kernel 逻辑一致，测试覆盖度较好。但 _normal_sample_xpu 中 padding_sampling_params 的移除造成了明显的 P0 运行时崩溃，需修复后方可合入。

PaddlePaddle-bot · 2026-05-07T11:58:02Z

-            paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]),
-            paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]),
-        )
        _, next_tokens = top_k_top_p_sampling(


🔴 Bug _normal_sample_xpu 使用了未定义变量 top_p、top_k、topp_seed，运行时会触发 NameError。

本次 PR 从该函数移除了 padding_sampling_params 调用，但未在 _normal_sample_xpu 内替换为 build_sampling_params，导致这三个变量完全未定义。XPU NAIVE 模式（spec_method is None 或 NAIVE）下一旦进入该函数即会崩溃。

建议在 top_k_top_p_sampling 调用前补充（参照 _verify_and_sample_xpu 的写法）：

top_p, top_k, topp_seed = build_sampling_params( sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.seed, share_inputs["seq_lens_this_time"], share_inputs["seq_lens_encoder"], token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]), increment_value=increment_value, )

同时需要在函数签名中增加 increment_value: int 参数，并在 forward_xpu 的调用处传入。

PaddlePaddle-bot · 2026-05-07T11:58:02Z

+  auto xpu_ctx = static_cast<const phi::XPUContext*>(dev_ctx);
+  api::Context* ctx = xpu_ctx->x_context();
+  if (top_p.is_cpu()) {
+    ctx = new api::Context(api::kCPU);


🔴 Bug ctx = new api::Context(api::kCPU) 造成内存泄漏：每次以 CPU tensor 调用该 op 时都会分配一个新的 api::Context 对象，但函数返回前从未 delete ctx。

建议改用 RAII 管理：

std::unique_ptr<api::Context> cpu_ctx; if (top_p.is_cpu()) { cpu_ctx = std::make_unique<api::Context>(api::kCPU); ctx = cpu_ctx.get(); }

PaddlePaddle-bot · 2026-05-07T11:58:02Z

@@ -0,0 +1,2 @@
+RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n    prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n    await self.add_requests(request)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n    raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])


📝 误提交 该文件是 benchmark 测试产生的错误日志，不应提交到代码仓库。建议删除此文件并在 .gitignore 中添加 benchmarks/error_output.txt 或 benchmarks/*.txt。

codecov-commenter · 2026-05-07T13:02:54Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@45350ff). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/sample/sampler.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7738   +/-   ##
==========================================
  Coverage           ?   71.61%           
==========================================
  Files              ?      396           
  Lines              ?    55568           
  Branches           ?     8688           
==========================================
  Hits               ?    39794           
  Misses             ?    13036           
  Partials           ?     2738

Flag	Coverage Δ
GPU	`71.61% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[XPU] add build_sampling_params op.

8a361a7

Copilot AI review requested due to automatic review settings May 7, 2026 10:44

Jiajun-Ji had a problem deploying to Metax_ci May 7, 2026 10:44 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji May 7, 2026 10:45 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

remove padding_sampling_params in _normal_sample_xpu.

61f7d08

Jiajun-Ji had a problem deploying to Metax_ci May 7, 2026 11:33 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] add build_sampling_params op.#7738

[XPU] add build_sampling_params op.#7738
Jiajun-Ji wants to merge 2 commits intoPaddlePaddle:developfrom
Jiajun-Ji:xpu-build_sampling_params

Jiajun-Ji commented May 7, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

Pre Commit

Approval

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

codecov-commenter commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,2 @@
		RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])

Conversation

Jiajun-Ji commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 3/10 通过

2.2 可选任务 — 22/26 通过

3 失败详情（仅 required）

Pre Commit

Approval

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 7, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jiajun-Ji commented May 7, 2026 •

edited

Loading

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading