【Models】add fleet model fallback by xiaoguoguo626807 · Pull Request #7534 · PaddlePaddle/FastDeploy

xiaoguoguo626807 · 2026-04-21T08:53:43Z

Motivation

新增 PaddleFleet 作为模型推理后端（--model-impl paddlefleet），通过将 PaddleFleet
TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核，实现在
PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

Modifications

config.py: 新增 paddlefleet 到 ModelImpl 类型定义
engine/args_utils.py: 支持 --model-impl paddlefleet CLI 参数
model_executor/models/paddleformers/base_fleet.py: 新增 PaddleFleetModelBase 基类和 FastDeployAttention 替换逻辑
model_executor/models/paddleformers/__init__.py: 注册 PaddleFleetForCausalLM 模型类
worker/worker_process.py: 同步新增 paddlefleet 选项

Usage or Command

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

CLAassistant · 2026-04-21T08:53:50Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
36 out of 39 committers have signed the CLA.

✅ xiaoguoguo626807
✅ ckl117
✅ yuanlehome
✅ qwes5s5
✅ xiaoxiaohehe001
✅ RuohengMa
✅ Deleter-D
✅ luukunn
✅ kevincheng2
✅ zhoutianzi666
✅ xyxinyang
✅ Jiang-Jia-Jun
✅ wuyujiji
✅ Tryorish
✅ cmcamdy
✅ juncaipeng
✅ huicongyao
✅ liyonghua0910
✅ Sunny-bot1
✅ jackyYang6
✅ ChowMingSing
✅ xjkmfa
✅ K11OntheBoat
✅ iosmers
✅ EmmonsCurse
✅ ApplEOFDiscord
✅ chang-wenbin
✅ BingooYang
✅ zhupengyang
✅ zoooo0820
✅ lizhenyun01
✅ Jiajun-Ji
✅ freeliuzc
✅ plusNew001
✅ Dryoung95
✅ gongshaotian
❌ root
❌ Copilot
❌ rain7996

root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot · 2026-04-21T08:53:51Z

Thanks for your contribution!

…e#6992) (PaddlePaddle#7176) * abort api bug fix * bug fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

…addle#7527)

…ions (PaddlePaddle#7542) * fix skip_x_record_stream * fix * optim

…d2h copy (PaddlePaddle#7431) * inplace_copy: encoder_batch_idx/decoder_batch_idx bs == 9 ok * inplace_copy: encoder_seq_lod/decoder_seq_lod bs == 9 ok * inplace_copy: all bs == 9 ok * inplace_copy: all cpu bs == 9 ok * inplace_copy: len_info_cpu bs == 9 ok * finished and rm unused code * prefix_block_tables reuse * refine * improve performance * remove block_table copy to cpu * fix unit test * fix * resolve conflict * refine code * fix * fix * fix * fix * fix * try fix unit tests * fix * tmp save * fix unit test * get_infer_param try less return values * add yinwei fix --------- Co-authored-by: yinwei <yinwei_hust@163.com>

…essing (PaddlePaddle#7485) * [NewFeature] support mm runner * [NewFeature] support mm runner part1 * support mm runner part2 * support mm runner part3 * support mm runner part4

* commit * commit * commit * commit * commit * commit * commit * commit

… in RL (PaddlePaddle#7522) * fix mtp clear graph bugs in rl

* add completions * add unit test * add unit test

… request count (PaddlePaddle#7499) * [Scheduler][BugFix] Fix token_budget calculation to use actual decode request count ## Motivation 当前 `token_budget` 的计算方式存在两个问题： 1. **预扣过多**：budget 按 `max_num_seqs * tokens_per_seq` 预扣，而不是 running 队列中实际处于 decode 阶段的请求数，导致 prefill 可用的 token 数被低估。 2. **循环内重复扣减**：decode 分支固定执行 `token_budget -= 1`，在 spec decode 场景下（`tokens_per_seq > 1`）每个 decode 请求只扣 1，少扣了 `num_speculative_tokens` 个；此外，当 running 队列中 prefill 请求耗尽 budget 后，排在其后的 decode 请求会被循环退出条件 `token_budget > 0` 提前跳过，导致调度漏发。 ## Modifications - `resource_manager_v1.py` - 新增 `_is_decoding(request)` 内部方法，封装 `num_computed_tokens >= need_prefill_tokens` 判断，全文统一使用 - 调度前统计 running 队列中真实的 decode 请求数 `num_running_decode_reqs`，以 `num_running_decode_reqs * tokens_per_seq` 一次性预扣 budget，替代原来的 `max_num_seqs * tokens_per_seq` - 去掉 decode 分支内的 `token_budget -= 1`（已在循环前整体预扣） - 修改循环退出条件：decode 请求不受 `token_budget <= 0` 限制，仅 prefill 请求在 budget 耗尽时退出 - `config.py` - 修复 `max_num_batched_tokens` 的合法性校验，考虑 spec decode 场景下 `tokens_per_seq = num_speculative_tokens + 1`，改为检查 `max_num_batched_tokens >= max_num_seqs * tokens_per_seq` ## Usage or Command ```bash # 普通启动（非spec decode，行为不变） python -m fastdeploy.entrypoints.openai.api_server \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 \ ... # spec decode 场景（tokens_per_seq = num_speculative_tokens + 1） # 确保 max_num_batched_tokens >= max_num_seqs * tokens_per_seq，否则启动报错 python -m fastdeploy.entrypoints.openai.api_server \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 \ --num-speculative-tokens 4 \ ... ``` * [FDConfig][BugFix] Fix AttributeError when speculative_config is SimpleNamespace without num_speculative_tokens ## Motivation 当测试中使用 `SimpleNamespace(method=None)` 构造 `speculative_config` 时， `config.py` 的 `check()` 方法直接访问 `self.speculative_config.num_speculative_tokens`，导致 `AttributeError: 'types.SimpleNamespace' object has no attribute 'num_speculative_tokens'`。影响以下测试文件： - tests/v1/test_resource_manager_v1.py - tests/eplb/test_eplb_utils.py - tests/eplb/test_experts_manager.py - tests/v1/cache_manager/test_prefix_cache.py - tests/v1/test_schedule_output.py ## Modifications - `fastdeploy/config.py`: 使用 `getattr(..., "num_speculative_tokens", 0)` 兜底，防止 speculative_config 对象缺少该属性时崩溃 - 测试文件：将 `speculative_config=SimpleNamespace(method=None)` 统一改为 `speculative_config=None`，与无投机解码场景语义一致 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

… files (PaddlePaddle#7432) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>

…addlePaddle#7553) * Revert "[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (PaddlePaddle#7486)" This reverts commit c9783a8. * [CI] Mark flash attention and related tests as multi_gpu

… fails (PaddlePaddle#7556) * [BugFix][Metax][KVCache] fix: resolve None callable error when import fails * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Metax][FIX] fix ci error caused by pr#7428 --------- Co-authored-by: Guanyu Chen (i26275) <i26275@metax-tech.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Cache queue support ipc * fix

…#7554)

…Paddle#7247) * merge develop * add limit_thinking_content_length_kernel kernel * add test * fix code style * fix_eos_token_id_len_check * fix plugin * support model runner * fix kernel * add reasoning_phase_token_constraint * [XPU] Refactor get_padding_offset to single kernel. (PaddlePaddle#7029) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len. * fix xpu pre process * fix code style * fix get padding offset * fix reasoning phase token constraint && add status print for test * add xpu reasoning_phase_token_constraint support in sampler * fix_get_padding_offset * fix_get_padding_offset * fix code style * update model runner * fix limit content length kernel * fix code style * fix cpu wapper * fix code style && rm cum_offsets_out * fix code style * support not have <<tool_call>> * add cpu ctx delete * fix_test * fix cpu ctx * fix_test * fix_test --------- Co-authored-by: Jiajun Ji <jiajunji_ee@163.com> Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>

…addle#7568) * set draft_model_use_cudagraph default to true and fix non-mtp cudaGraph in spec-decoding * optimize server log * fix format

* rl support mix_quant * code check

* support blackwell gemm in ll * add attr * opt quant

…ePaddle#7600) * support logprob overlap

…dlePaddle#7471) * [BugFix][KVCache] Fix inference slowdown when enabling CPU cache on Blackwell GPU 在 B 卡（Blackwell GPU）上开启 CPU cache（num_cpu_blocks > 0）时，推理性能出现明显降速。根因是 `create_cache_tensor` 的判断逻辑将 `num_cpu_blocks > 0` 作为跳过 GPU cache tensor 创建的条件，导致 B 卡上错误地跳过了 GPU cache tensor 的初始化。 - `fastdeploy/worker/gpu_model_runner.py`：`create_cache_tensor` 判断中移除 `num_cpu_blocks > 0` 条件（两处：`init_cache` 和 `clear_cache`），保证开启 CPU cache 时 GPU cache tensor 仍正常创建 - `fastdeploy/cache_manager/prefix_cache_manager.py`：将 `--create_cache_tensor` 参数从非 splitwise 场景的条件判断中移出，统一归到 `kvcache_storage_backend` 配置路径下，逻辑更清晰 ```bash python -m fastdeploy.entrypoints.openai.api_server \ --num-cpu-blocks <N> \ ... ``` * [BugFix][KVCache] Enlarge prealloc threshold for speculative decoding ## Motivation 投机解码场景下，每个调度步骤一次性消耗 `num_spec_tokens` 个 slot，原有的 `prealloc_dec_block_slot_num_threshold` 阈值偏小，导致块预分配触发不够及时，影响推理性能。 ## Modifications 在 `FDConfig` 初始化阶段，当启用 speculative decoding 时，将 `prealloc_dec_block_slot_num_threshold` 扩大为原值乘以 `num_spec_tokens`，同时确保不超过 enc_dec_block 容量上限。 ## Usage or Command 启用投机解码时，无需额外配置，阈值自动调整： ```bash python -m fastdeploy.entrypoints.openai.api_server \ --speculative-config '{"method": "draft_model", "num_speculative_tokens": 4}' \ ... ``` * [BugFix][KVCache][FDConfig] Fix prealloc threshold and create_cache_tensor for splitwise ## Motivation 两处 bug 修复： 1. speculative decoding 场景下，prealloc_dec_block_slot_num_threshold 的放大系数应为 (num_spec_tokens + 1) 而非 num_spec_tokens，确保预分配触发时机足够提前。 2. kvcache_storage_backend 启用时，--create_cache_tensor 参数只应在非 splitwise 模式下传入，避免 splitwise P 节点错误创建 cache tensor。 ## Modifications - fastdeploy/config.py: 修正 prealloc 放大系数为 (num_spec_tokens + 1)，并添加 logger 打印变动前后的值 - fastdeploy/cache_manager/prefix_cache_manager.py: --create_cache_tensor 仅在非 splitwise 模式下追加 ## Usage or Command ```bash # 启动服务（含 speculative decoding + kvcache storage） python -m fastdeploy.entrypoints.openai.api_server \ --model <model_path> \ --speculative-model <draft_model_path> \ --num-speculative-tokens 3 \ --kvcache-storage-backend <backend> ``` * [BugFix][SpecDecode] Fix create_cache_tensor condition in MTPProposer ## Motivation `create_cache_tensor` 的判断条件中包含 `num_cpu_blocks > 0`，导致在 B卡 CPU cache 场景下，MTP 的 kv cache 创建逻辑出现异常。 ## Modifications 移除 `create_cache_tensor` 判断中的 `num_cpu_blocks > 0` 条件，仅保留 `kvcache_storage_backend` 和 `splitwise_role` 的判断，避免冗余条件干扰。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] Address review comments: fix negative cap, sync runner fixes, update comments ## Modifications 1. **config.py**: Fix `enc_dec_block_num=0` causing negative upper bound for `prealloc_dec_block_slot_num_threshold`. Use `max(0, ...)` to guard against negative cap. Also fix comment to say `num_spec_tokens + 1` (matching code). 2. **xpu_model_runner.py / metax_model_runner.py**: Sync the same fix from gpu_model_runner.py — remove `num_cpu_blocks > 0` from `create_cache_tensor` condition. CPU cache enablement should not prevent GPU runners from creating GPU cache tensors on XPU/Metax platforms either. 3. **gpu_model_runner.py / mtp.py / xpu_model_runner.py / metax_model_runner.py**: Update stale comments to clarify that CPU cache does NOT prevent GPU cache tensor creation; cache transfer manager handles CPU<->GPU swap on top. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…s_and_idx. (PaddlePaddle#7463)

…dlePaddle#7671) * support routed_scaling_factor_learnable

…rted (PaddlePaddle#7633) * [BugFix] fix preempted token id not returned when a full batch is aborted * [fix] changed fake_sampled_token_ids shape and filled value * [test] add test * [chore] move code place * [test] add more tests and docstring

…lePaddle#7668) * [Router] Support launch golang-router by python command * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix(golang_router): fix launch.py bugs and add unit tests Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/57636bb1-779a-417f-934c-07a1462ed41c Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * fix(build.sh, docs): detect host arch for fd-router download; update router docs Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/7a6cb757-5f4d-4c45-9272-e1e3da43ede4 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * [Router] Move fd-router download from build.sh to setup.py - Remove download_fd_router from build.sh (setup.py handles it via CustomBdistWheel.run) - Add download_fd_router to setup.py with aarch64 support - Always register CustomBdistWheel in cmdclass (not gated by rdma_comm_supported) - Add fd-router binary to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Revert doc changes for router.md Will update docs in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Build] Add BUILD_WHEEL=2 mode to skip custom ops compilation When only Python/build scripts are changed, use `bash build.sh 2` to package the wheel directly without recompiling custom ops, significantly reducing build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Add deprecation warning to Python Router Print a warning when launching the Python Router, recommending the Golang Router for production use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Fix] Suppress noisy warnings and replace pkg_resources - Suppress transformers/paddleformers/setuptools warnings on startup - Replace pkg_resources with importlib.metadata to fix ModuleNotFoundError - Change sm_version print to logging.debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Test] Fix unit tests for eval.py and golang_router_launch - test_eval.py: replace pkg_resources with importlib.metadata mocks - test_golang_router_launch.py: use patch.object on _launch_module to avoid AttributeError from stub module resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: run pre-commit to fix code formatting issues Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/91173bf4-9b99-4cf4-b95b-0758fed8abfa --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…_router.launch) (PaddlePaddle#7673) * docs: update router documentation to use Python CLI (python -m fastdeploy.golang_router.launch) Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * docs: fix duplicate link in best_practices/Disaggregated.md Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

…e#7608)

…tion (additional fixes) (PaddlePaddle#7684)

…ort int32) (PaddlePaddle#7648) * fix infer seed * fix infer seed for mtp * fix offset * fix offset

PaddlePaddle#7725)

…/FastDeploy into fleet

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 14:07:03

📋 Review 摘要

PR 概述：新增 PaddleFleet 作为模型推理后端（--model-impl paddlefleet），通过替换 TransformerLayer 中的 core_attention 复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
变更范围：model_executor/models/paddleformers/、config.py、engine/args_utils.py、custom_ops/gpu_ops/
影响面 Tag：[Models] [OP] [FDConfig] [Engine]

📝 PR 规范检查

标题使用了中文括号 【Models】，不符合规范要求的 [Models] 英文方括号格式；Accuracy Tests 段仅有模板注释，无实际内容；Add unit tests 未勾选且未说明原因。

标题建议（可直接复制）：

[Models] add fleet model fallback

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
新增 PaddleFleet 作为模型推理后端（`--model-impl paddlefleet`），通过将 PaddleFleet TransformerLayer 中的 `core_attention` 替换为 FastDeploy Attention 内核，实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

## Modifications
- `config.py`: 新增 `paddlefleet` 到 `ModelImpl` 类型定义
- `engine/args_utils.py`: 支持 `--model-impl paddlefleet` CLI 参数，并补充校验逻辑
- `model_executor/models/paddleformers/base_fleet.py`: 新增 `PaddleFleetModelBase` 基类、`FastDeployAttention` 层及 `patch_paddlefleet_core_attention` 替换函数
- `model_executor/models/paddleformers/__init__.py`: 注册 `PaddleFleetForCausalLM` 模型类
- `custom_ops/gpu_ops/fused_cast_sigmoid_bias.cu`: 新增 MoE gating 融合算子（cast + sigmoid + bias）
- `custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu`: 新增 logprob 采样参数构建算子
- `custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu`: 将 `SpeculateGetTargetLogits` 重命名并扩展为 `SpeculateGetAcceptTokensAndLogits`（接口新增 2 个参数）

## Usage or Command
```bash
python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet
```

## Accuracy Tests
N/A（本 PR 新增 PaddleFleet 推理后端，尚未提供与参考实现的 logits 对齐数据）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu:1`	新增 CUDA 文件未加入 `setup_ops.py` 编译列表，运行时将报错
📝 PR 规范	—	标题使用中文括号；`Accuracy Tests` 段为空；`Add unit tests` 未说明原因

总体评价

PaddleFleet backend 的整体架构设计合理，config.py/args_utils.py/模型注册的三处同步均已完成。但 build_sampling_params_logprob.cu 未注册到 setup_ops.py 是阻塞性 bug，需在合入前修复。

PaddlePaddle-bot · 2026-05-07T06:10:50Z

@@ -0,0 +1,129 @@
+// Copyright (c) 2026 PaddlePaddle Authors. All Rights Reserved.


🔴 Bug 新文件已添加但未注册到编译列表

build_sampling_params_logprob.cu 是本 PR 新增的 CUDA kernel 文件，但在 custom_ops/setup_ops.py 的源文件列表中缺少该文件的注册条目。

参考 fused_cast_sigmoid_bias.cu 的处理方式（已在 setup_ops.py 中正确添加），需要在 setup_ops.py 对应的源文件列表中补充该条目，否则该 kernel 不会被编译，Python 侧（sampler.py）对 build_sampling_params_logprob 的调用将在运行时失败。

建议在 setup_ops.py 中添加（参考同目录其他 speculate_decoding kernel 的位置）：

"gpu_ops/speculate_decoding/build_sampling_params_logprob.cu",

PaddlePaddle-bot · 2026-05-07T06:15:09Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 14:13:31

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3edb314
Merge base: d8cdda8 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 2 个 required 任务失败，6 个 required 任务运行中，需等待运行完成后再评估合并条件。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
63(0)	63	34	4	9	3	9

2 任务状态汇总

2.1 Required任务 : 2/18 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	基础设施：PR 未获必要 Reviewer 审批	请相关 Reviewer 审批此 PR	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`Run FastDeploy LogProb Tests / run_tests_logprob`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
⏳	`Run Stable Tests / stable_tests`	-	运行中	-	Job	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 32/45 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	26s	Job	-
✅	其余 32 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 基础设施（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: PR 未获得必要 Reviewer 审批，审批检查返回 exit code 6
分析器: 通用分析(fallback)

根因详情:
Approval Workflow 是 PR 合并前的必要审批检查流程。exit code 6 表示 PR 尚未获得所有必要 Reviewer 的审批（Required Approvals），此检查与代码变更本身无关，只有在相关 Reviewer 完成 Approve 操作后才会通过。

关键日志:

[FAILURE]: Process completed with exit code 6.

修复建议:

请相关 Reviewer 对此 PR 进行 Approve 操作
确认所有必要 Reviewer 已完成审批

修复建议摘要: 请相关 Reviewer 审批此 PR

链接: 查看日志

add fleet model fallback

3f38fd5

xiaoguoguo626807 had a problem deploying to Metax_ci April 21, 2026 08:53 — with GitHub Actions Error

code style

164c858

xiaoguoguo626807 had a problem deploying to Metax_ci April 21, 2026 09:03 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

qwes5s5 and others added 4 commits April 21, 2026 19:27

[BugFix] Fix bugs in /v1/abort_requests interface from PR(PaddlePaddl…

8883757

…e#6992) (PaddlePaddle#7176) * abort api bug fix * bug fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

add m_grouped_bf16_gemm_nn_contiguous, del paddle.batch_gemm (PaddleP…

c618a39

…addle#7527)

[BugFix] Fix skip_x_record_stream incompatibility across deep_ep vers…

534c43c

…ions (PaddlePaddle#7542) * fix skip_x_record_stream * fix * optim

fix review

9f2137d

xiaoguoguo626807 had a problem deploying to Metax_ci April 22, 2026 01:48 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Jiang-Jia-Jun and others added 17 commits April 22, 2026 10:58

Change default value of enable_output_caching to False

2edb30c

Update args_utils.py

76b960c

[Optimization] Support multimodal runner for image/video feature proc…

a09792e

…essing (PaddlePaddle#7485) * [NewFeature] support mm runner * [NewFeature] support mm runner part1 * support mm runner part2 * support mm runner part3 * support mm runner part4

[OP] latent moe deepgemm support (PaddlePaddle#7537)

e580cf0

* commit * commit * commit * commit * commit * commit * commit * commit

[BugFix] Fix clear_parameters hang issue in MTP during weight cleanup…

68dbe71

… in RL (PaddlePaddle#7522) * fix mtp clear graph bugs in rl

[DataProcessor] add completions (PaddlePaddle#7543)

4b86e2b

* add completions * add unit test * add unit test

support deepgemm without bias input (PaddlePaddle#7559)

a6a740f

feat: consolidate cache, worker_process, and paddle logs into unified…

f436ccf

… files (PaddlePaddle#7432) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>

Update args_utils.py (PaddlePaddle#7550)

7e355a0

Revert "[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417" (P…

9d5892c

…addlePaddle#7553) * Revert "[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (PaddlePaddle#7486)" This reverts commit c9783a8. * [CI] Mark flash attention and related tests as multi_gpu

[KVCache] Cache queue support ipc (PaddlePaddle#7567)

4ddd72c

* Cache queue support ipc * fix

[Iluvatar][CI] Fix not defined update_attn_mask_offsets (PaddlePaddle…

d595e04

…#7554)

[Speculative Decoding] Optimize spec-decoding and server log (PaddleP…

7a08e54

…addle#7568) * set draft_model_use_cudagraph default to true and fix non-mtp cudaGraph in spec-decoding * optimize server log * fix format

qwes5s5 and others added 24 commits April 28, 2026 20:18

abort requests fix2 (PaddlePaddle#7642)

061ce69

[RL] rl support mix_quant (PaddlePaddle#7645)

7dd4105

* rl support mix_quant * code check

docs: fix router parameters typo (PaddlePaddle#7657)

22d8185

Refine metrics and trace for pd (PaddlePaddle#7613)

45350ff

[Feature] support blackwell gemm in ll (PaddlePaddle#7374)

92ae10c

* support blackwell gemm in ll * add attr * opt quant

fix bug build_logprobs_response (PaddlePaddle#7658)

179f38e

[Optimization] Support logprob overlap in speculative decoding (Paddl…

4d71cd2

…ePaddle#7600) * support logprob overlap

Fix key error for updating mtp model weights (PaddlePaddle#7675)

cca959d

[XPU] handle inplace return value for XPU speculate_set_value_by_flag…

9631e2c

…s_and_idx. (PaddlePaddle#7463)

[Feature] Support routed_scaling_factor_learnable for MoE layers (Pad…

19f0e9f

…dlePaddle#7671) * support routed_scaling_factor_learnable

[BugFix] fix incorrect nnode computation (PaddlePaddle#7672)

4554fc7

fix seq_lens_decoder bug (PaddlePaddle#7681)

358cdc9

feat: add traceback to error logs and optimize trace log (PaddlePaddl…

0397ab5

…e#7608)

add profile code (PaddlePaddle#7678)

9d3bb29

[BugFix] Fix get_tasks returns empty list and incorrect nnode computa…

d70f33d

…tion (additional fixes) (PaddlePaddle#7684)

[XPU] padding_sampling_params use int32 MAX_INFER_SEED ("%" only supp…

f36640f

…ort int32) (PaddlePaddle#7648) * fix infer seed * fix infer seed for mtp * fix offset * fix offset

[BugFix] Fix incorrect create tensor condition when clearing mtp cache (

39e3ae6

PaddlePaddle#7725)

Merge commit 'refs/pull/7534/head' of https://github.com/PaddlePaddle…

035b467

…/FastDeploy into fleet

fix bug

ffbb06d

fix bug

3edb314

xiaoguoguo626807 had a problem deploying to Metax_ci May 7, 2026 05:57 — with GitHub Actions Failure

xiaoguoguo626807 had a problem deploying to Metax_ci May 7, 2026 06:03 — with GitHub Actions Failure

xiaoguoguo626807 closed this May 7, 2026

PaddlePaddle-bot suggested changes May 7, 2026

View reviewed changes

xiaoguoguo626807 deleted the fleet branch May 7, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Models】add fleet model fallback#7534

【Models】add fleet model fallback#7534
xiaoguoguo626807 wants to merge 81 commits intoPaddlePaddle:developfrom
xiaoguoguo626807:fleet

xiaoguoguo626807 commented Apr 21, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 21, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026

Approval

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

		@@ -0,0 +1,129 @@
		// Copyright (c) 2026 PaddlePaddle Authors. All Rights Reserved.

Conversation

xiaoguoguo626807 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

CLAassistant commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 7, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/18 通过

2.2 可选任务 — 32/45 通过

3 失败详情（仅 required）

Approval

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

xiaoguoguo626807 commented Apr 21, 2026 •

edited

Loading

CLAassistant commented Apr 21, 2026 •

edited

Loading