[XPU] add build_sampling_params op.#7738
[XPU] add build_sampling_params op.#7738Jiajun-Ji wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 为 XPU 后端新增 build_sampling_params 自定义算子,用 XPU kernel 替换原先 Python 的 sampling 参数 padding 逻辑,并尝试将 infer_seed 的更新收敛到算子内部,以对齐 GPU 的 seed 步进策略(尤其在 speculative decoding 场景)。
Changes:
- 新增 XPU
build_sampling_paramskernel + plugin wrapper + Paddle static op,并在 XPU speculative verify(TARGET_MATCH)路径中接入。 - XPU ModelRunner 侧引入
increment_value(对齐 GPU:非 speculative 为 4,speculative 为(num_speculative_tokens+1)*4),并调整infer_seed的更新时机。 - 新增
custom_ops/xpu_ops/test/test_build_sampling_params.py单测,对比 Python 参考实现并覆盖多种 batch 形态与 wrap-around。
PR 元信息检查(需补充)
- 标题已包含
[XPU]tag,格式符合要求。 - 描述中 “Modifications / Usage or Command / Accuracy Tests” 等小节未补全;若该算子会影响采样结果或可复现性,建议补充 accuracy 对比与对应运行命令/环境信息;如不加单测或无法跑到 XPU CI,也需注明原因(本 PR 已新增单测文件,但仍建议在描述里给出如何运行)。
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/xpu_model_runner.py | 计算并下发 increment_value,并调整 speculative 场景下 infer_seed 的更新逻辑 |
| fastdeploy/model_executor/layers/sample/sampler.py | XPU verify(TARGET_MATCH) 路径改用 build_sampling_params,并透传 increment_value |
| custom_ops/xpu_ops/test/test_build_sampling_params.py | 新增 XPU op 单测,与 Python 参考实现对齐校验 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp | 新增 plugin wrapper(CPU + XPU3 分发) |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu | 新增 Kunlun3 XPU kernel 实现 |
| custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h | 导出 build_sampling_params 声明 |
| custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc | 新增 Paddle static op 注册与调用桥接 |
| # 7. Updata 'infer_seed' and step_paddle() | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: |
| share_inputs["seq_lens_this_time"], | ||
| share_inputs["seq_lens_encoder"], | ||
| token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]), | ||
| increment_value=increment_value, |
| api::Context* ctx = xpu_ctx->x_context(); | ||
| if (top_p.is_cpu()) { | ||
| ctx = new api::Context(api::kCPU); |
| // Shared prefix-sum buffer: each cluster computes its own pad_start via | ||
| // a two-pass scan over seq_lens_this_time / seq_lens_encoder. | ||
| // We use a simple approach: core 0 of cluster 0 writes per-batch start | ||
| // offsets into a global scratch area is not available here, so instead we | ||
| // compute pad_start with a sequential scan in core 0 of each cluster. | ||
| // Because clusters run concurrently we cannot share a global accumulator; | ||
| // instead each cluster independently sums the first `bi` entries. | ||
| // This is O(bs) per cluster but bs is typically small (<=512). |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 3/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Pre Commit — 代码规范(置信度: 高)Pre Commit
根因详情: 关键日志: 修复建议:
修复建议摘要: 修复 sampler.py L1078-1081 处未定义变量 关联变更: Approval — 流程问题(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请相关 Reviewer 完成 PR 审批 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 19:38:42
📋 Review 摘要
PR 概述:将 XPU 下 padding_sampling_params 的 Python 实现改为 XPU kernel 实现 build_sampling_params,并将 infer_seed 更新逻辑收敛至 kernel 内部,对齐 GPU 的 increment_value 步进。
变更范围:custom_ops/xpu_ops/、fastdeploy/model_executor/layers/sample/sampler.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag:[XPU] [OP]
📝 PR 规范检查
标题 [XPU] add build_sampling_params op. 含有效 Tag,格式合规。但 PR 描述中 ## Modifications 和 ## Usage or Command 为空,且 Checklist 均未勾选,不符合模板要求。
标题建议(可直接复制):
[XPU][OP] Add build_sampling_params XPU kernel op
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
将XPU下的 `padding_sampling_params` 的 Python 实现改为 XPU kernel 实现 `build_sampling_params`,同时将 `infer_seed` 的更新逻辑收敛到 `build_sampling_params` 内部,并将 `infer_seed` 的 `increment_value` 步进对齐 GPU 实现(speculative decoding 场景为 `(num_speculative_tokens + 1) * 4`,普通推理为 `4`)。
## Modifications
- `custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`:新增 XPU custom op 的 Paddle 注册入口
- `custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`:新增 `build_sampling_params` 函数声明
- `custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`:新增 KunlunXPU3 上的 kernel 实现
- `custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`:新增 CPU 参考实现和 XPU3 wrapper
- `custom_ops/xpu_ops/test/test_build_sampling_params.py`:新增单元测试(纯 decoder、纯 encoder、混合、单 item、seed 环绕 6 种场景)
- `fastdeploy/model_executor/layers/sample/sampler.py`:`_verify_and_sample_xpu` 改为调用 `build_sampling_params`,移除 `_normal_sample_xpu` 中的 `padding_sampling_params` 调用
- `fastdeploy/worker/xpu_model_runner.py`:`increment_value` 根据是否投机解码动态计算;speculative decoding 路径不再在外部更新 `infer_seed`
## Usage or Command
N/A
## Accuracy Tests
精度对比截图见 PR 附图,XPU kernel 输出与 Python 参考实现一致。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/sample/sampler.py:1073 |
_normal_sample_xpu 使用未定义变量 top_p/top_k/topp_seed,XPU NAIVE 模式必崩 |
| 🔴 Bug | custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:38 |
new api::Context(api::kCPU) 无对应 delete,每次 CPU 路径调用均内存泄漏 |
| 📝 误提交 | benchmarks/error_output.txt:1 |
benchmark 错误日志文件不应提交入库 |
总体评价
XPU kernel 设计思路合理,CPU 参考实现与 XPU3 kernel 逻辑一致,测试覆盖度较好。但 _normal_sample_xpu 中 padding_sampling_params 的移除造成了明显的 P0 运行时崩溃,需修复后方可合入。
| paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]), | ||
| paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]), | ||
| ) | ||
| _, next_tokens = top_k_top_p_sampling( |
There was a problem hiding this comment.
🔴 Bug _normal_sample_xpu 使用了未定义变量 top_p、top_k、topp_seed,运行时会触发 NameError。
本次 PR 从该函数移除了 padding_sampling_params 调用,但未在 _normal_sample_xpu 内替换为 build_sampling_params,导致这三个变量完全未定义。XPU NAIVE 模式(spec_method is None 或 NAIVE)下一旦进入该函数即会崩溃。
建议在 top_k_top_p_sampling 调用前补充(参照 _verify_and_sample_xpu 的写法):
top_p, top_k, topp_seed = build_sampling_params(
sampling_metadata.top_p,
sampling_metadata.top_k,
sampling_metadata.seed,
share_inputs["seq_lens_this_time"],
share_inputs["seq_lens_encoder"],
token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
increment_value=increment_value,
)同时需要在函数签名中增加 increment_value: int 参数,并在 forward_xpu 的调用处传入。
| auto xpu_ctx = static_cast<const phi::XPUContext*>(dev_ctx); | ||
| api::Context* ctx = xpu_ctx->x_context(); | ||
| if (top_p.is_cpu()) { | ||
| ctx = new api::Context(api::kCPU); |
There was a problem hiding this comment.
🔴 Bug ctx = new api::Context(api::kCPU) 造成内存泄漏:每次以 CPU tensor 调用该 op 时都会分配一个新的 api::Context 对象,但函数返回前从未 delete ctx。
建议改用 RAII 管理:
std::unique_ptr<api::Context> cpu_ctx;
if (top_p.is_cpu()) {
cpu_ctx = std::make_unique<api::Context>(api::kCPU);
ctx = cpu_ctx.get();
}| @@ -0,0 +1,2 @@ | |||
| RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[]) | |||
There was a problem hiding this comment.
📝 误提交 该文件是 benchmark 测试产生的错误日志,不应提交到代码仓库。建议删除此文件并在 .gitignore 中添加 benchmarks/error_output.txt 或 benchmarks/*.txt。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7738 +/- ##
==========================================
Coverage ? 71.61%
==========================================
Files ? 396
Lines ? 55568
Branches ? 8688
==========================================
Hits ? 39794
Misses ? 13036
Partials ? 2738
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
将XPU下的padding_sampling_params的py实现改为XPU kernel实现build_sampling_params,此外将infer_seed更新收敛到build_sampling_params内部,并将infer_seed的increment_value步进对齐GPU实现。
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.