【Models】add fleet model fallback#7534
【Models】add fleet model fallback#7534xiaoguoguo626807 wants to merge 81 commits intoPaddlePaddle:developfrom
Conversation
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thanks for your contribution! |
…e#6992) (PaddlePaddle#7176) * abort api bug fix * bug fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…ions (PaddlePaddle#7542) * fix skip_x_record_stream * fix * optim
…d2h copy (PaddlePaddle#7431) * inplace_copy: encoder_batch_idx/decoder_batch_idx bs == 9 ok * inplace_copy: encoder_seq_lod/decoder_seq_lod bs == 9 ok * inplace_copy: all bs == 9 ok * inplace_copy: all cpu bs == 9 ok * inplace_copy: len_info_cpu bs == 9 ok * finished and rm unused code * prefix_block_tables reuse * refine * improve performance * remove block_table copy to cpu * fix unit test * fix * resolve conflict * refine code * fix * fix * fix * fix * fix * try fix unit tests * fix * tmp save * fix unit test * get_infer_param try less return values * add yinwei fix --------- Co-authored-by: yinwei <yinwei_hust@163.com>
…essing (PaddlePaddle#7485) * [NewFeature] support mm runner * [NewFeature] support mm runner part1 * support mm runner part2 * support mm runner part3 * support mm runner part4
* commit * commit * commit * commit * commit * commit * commit * commit
… in RL (PaddlePaddle#7522) * fix mtp clear graph bugs in rl
* add completions * add unit test * add unit test
… request count (PaddlePaddle#7499) * [Scheduler][BugFix] Fix token_budget calculation to use actual decode request count ## Motivation 当前 `token_budget` 的计算方式存在两个问题: 1. **预扣过多**:budget 按 `max_num_seqs * tokens_per_seq` 预扣,而不是 running 队列中实际处于 decode 阶段的请求数,导致 prefill 可用的 token 数被低估。 2. **循环内重复扣减**:decode 分支固定执行 `token_budget -= 1`,在 spec decode 场景下(`tokens_per_seq > 1`)每个 decode 请求只扣 1,少扣了 `num_speculative_tokens` 个;此外,当 running 队列中 prefill 请求耗尽 budget 后,排在其后的 decode 请求会被循环退出条件 `token_budget > 0` 提前跳过,导致调度漏发。 ## Modifications - `resource_manager_v1.py` - 新增 `_is_decoding(request)` 内部方法,封装 `num_computed_tokens >= need_prefill_tokens` 判断,全文统一使用 - 调度前统计 running 队列中真实的 decode 请求数 `num_running_decode_reqs`,以 `num_running_decode_reqs * tokens_per_seq` 一次性预扣 budget,替代原来的 `max_num_seqs * tokens_per_seq` - 去掉 decode 分支内的 `token_budget -= 1`(已在循环前整体预扣) - 修改循环退出条件:decode 请求不受 `token_budget <= 0` 限制,仅 prefill 请求在 budget 耗尽时退出 - `config.py` - 修复 `max_num_batched_tokens` 的合法性校验,考虑 spec decode 场景下 `tokens_per_seq = num_speculative_tokens + 1`,改为检查 `max_num_batched_tokens >= max_num_seqs * tokens_per_seq` ## Usage or Command ```bash # 普通启动(非spec decode,行为不变) python -m fastdeploy.entrypoints.openai.api_server \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 \ ... # spec decode 场景(tokens_per_seq = num_speculative_tokens + 1) # 确保 max_num_batched_tokens >= max_num_seqs * tokens_per_seq,否则启动报错 python -m fastdeploy.entrypoints.openai.api_server \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 \ --num-speculative-tokens 4 \ ... ``` * [FDConfig][BugFix] Fix AttributeError when speculative_config is SimpleNamespace without num_speculative_tokens ## Motivation 当测试中使用 `SimpleNamespace(method=None)` 构造 `speculative_config` 时, `config.py` 的 `check()` 方法直接访问 `self.speculative_config.num_speculative_tokens`, 导致 `AttributeError: 'types.SimpleNamespace' object has no attribute 'num_speculative_tokens'`。 影响以下测试文件: - tests/v1/test_resource_manager_v1.py - tests/eplb/test_eplb_utils.py - tests/eplb/test_experts_manager.py - tests/v1/cache_manager/test_prefix_cache.py - tests/v1/test_schedule_output.py ## Modifications - `fastdeploy/config.py`: 使用 `getattr(..., "num_speculative_tokens", 0)` 兜底, 防止 speculative_config 对象缺少该属性时崩溃 - 测试文件:将 `speculative_config=SimpleNamespace(method=None)` 统一改为 `speculative_config=None`,与无投机解码场景语义一致 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… files (PaddlePaddle#7432) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…addlePaddle#7553) * Revert "[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (PaddlePaddle#7486)" This reverts commit c9783a8. * [CI] Mark flash attention and related tests as multi_gpu
… fails (PaddlePaddle#7556) * [BugFix][Metax][KVCache] fix: resolve None callable error when import fails * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Metax][FIX] fix ci error caused by pr#7428 --------- Co-authored-by: Guanyu Chen (i26275) <i26275@metax-tech.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Cache queue support ipc * fix
…Paddle#7247) * merge develop * add limit_thinking_content_length_kernel kernel * add test * fix code style * fix_eos_token_id_len_check * fix plugin * support model runner * fix kernel * add reasoning_phase_token_constraint * [XPU] Refactor get_padding_offset to single kernel. (PaddlePaddle#7029) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len. * fix xpu pre process * fix code style * fix get padding offset * fix reasoning phase token constraint && add status print for test * add xpu reasoning_phase_token_constraint support in sampler * fix_get_padding_offset * fix_get_padding_offset * fix code style * update model runner * fix limit content length kernel * fix code style * fix cpu wapper * fix code style && rm cum_offsets_out * fix code style * support not have <<tool_call>> * add cpu ctx delete * fix_test * fix cpu ctx * fix_test * fix_test --------- Co-authored-by: Jiajun Ji <jiajunji_ee@163.com> Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addle#7568) * set draft_model_use_cudagraph default to true and fix non-mtp cudaGraph in spec-decoding * optimize server log * fix format
* rl support mix_quant * code check
* support blackwell gemm in ll * add attr * opt quant
…ePaddle#7600) * support logprob overlap
…dlePaddle#7471) * [BugFix][KVCache] Fix inference slowdown when enabling CPU cache on Blackwell GPU 在 B 卡(Blackwell GPU)上开启 CPU cache(num_cpu_blocks > 0)时,推理性能出现明显降速。 根因是 `create_cache_tensor` 的判断逻辑将 `num_cpu_blocks > 0` 作为跳过 GPU cache tensor 创建的条件,导致 B 卡上错误地跳过了 GPU cache tensor 的初始化。 - `fastdeploy/worker/gpu_model_runner.py`:`create_cache_tensor` 判断中移除 `num_cpu_blocks > 0` 条件(两处:`init_cache` 和 `clear_cache`),保证开启 CPU cache 时 GPU cache tensor 仍正常创建 - `fastdeploy/cache_manager/prefix_cache_manager.py`:将 `--create_cache_tensor` 参数从非 splitwise 场景的条件判断中移出,统一归到 `kvcache_storage_backend` 配置路径下,逻辑更清晰 ```bash python -m fastdeploy.entrypoints.openai.api_server \ --num-cpu-blocks <N> \ ... ``` * [BugFix][KVCache] Enlarge prealloc threshold for speculative decoding ## Motivation 投机解码场景下,每个调度步骤一次性消耗 `num_spec_tokens` 个 slot,原有的 `prealloc_dec_block_slot_num_threshold` 阈值偏小,导致块预分配触发不够及时, 影响推理性能。 ## Modifications 在 `FDConfig` 初始化阶段,当启用 speculative decoding 时,将 `prealloc_dec_block_slot_num_threshold` 扩大为原值乘以 `num_spec_tokens`, 同时确保不超过 enc_dec_block 容量上限。 ## Usage or Command 启用投机解码时,无需额外配置,阈值自动调整: ```bash python -m fastdeploy.entrypoints.openai.api_server \ --speculative-config '{"method": "draft_model", "num_speculative_tokens": 4}' \ ... ``` * [BugFix][KVCache][FDConfig] Fix prealloc threshold and create_cache_tensor for splitwise ## Motivation 两处 bug 修复: 1. speculative decoding 场景下,prealloc_dec_block_slot_num_threshold 的放大系数应为 (num_spec_tokens + 1) 而非 num_spec_tokens,确保预分配触发时机足够提前。 2. kvcache_storage_backend 启用时,--create_cache_tensor 参数只应在非 splitwise 模式下传入,避免 splitwise P 节点错误创建 cache tensor。 ## Modifications - fastdeploy/config.py: 修正 prealloc 放大系数为 (num_spec_tokens + 1),并添加 logger 打印变动前后的值 - fastdeploy/cache_manager/prefix_cache_manager.py: --create_cache_tensor 仅在非 splitwise 模式下追加 ## Usage or Command ```bash # 启动服务(含 speculative decoding + kvcache storage) python -m fastdeploy.entrypoints.openai.api_server \ --model <model_path> \ --speculative-model <draft_model_path> \ --num-speculative-tokens 3 \ --kvcache-storage-backend <backend> ``` * [BugFix][SpecDecode] Fix create_cache_tensor condition in MTPProposer ## Motivation `create_cache_tensor` 的判断条件中包含 `num_cpu_blocks > 0`,导致在 B卡 CPU cache 场景下,MTP 的 kv cache 创建逻辑出现异常。 ## Modifications 移除 `create_cache_tensor` 判断中的 `num_cpu_blocks > 0` 条件,仅保留 `kvcache_storage_backend` 和 `splitwise_role` 的判断,避免冗余条件干扰。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] Address review comments: fix negative cap, sync runner fixes, update comments ## Modifications 1. **config.py**: Fix `enc_dec_block_num=0` causing negative upper bound for `prealloc_dec_block_slot_num_threshold`. Use `max(0, ...)` to guard against negative cap. Also fix comment to say `num_spec_tokens + 1` (matching code). 2. **xpu_model_runner.py / metax_model_runner.py**: Sync the same fix from gpu_model_runner.py — remove `num_cpu_blocks > 0` from `create_cache_tensor` condition. CPU cache enablement should not prevent GPU runners from creating GPU cache tensors on XPU/Metax platforms either. 3. **gpu_model_runner.py / mtp.py / xpu_model_runner.py / metax_model_runner.py**: Update stale comments to clarify that CPU cache does NOT prevent GPU cache tensor creation; cache transfer manager handles CPU<->GPU swap on top. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…dlePaddle#7671) * support routed_scaling_factor_learnable
…rted (PaddlePaddle#7633) * [BugFix] fix preempted token id not returned when a full batch is aborted * [fix] changed fake_sampled_token_ids shape and filled value * [test] add test * [chore] move code place * [test] add more tests and docstring
…lePaddle#7668) * [Router] Support launch golang-router by python command * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix(golang_router): fix launch.py bugs and add unit tests Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/57636bb1-779a-417f-934c-07a1462ed41c Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * fix(build.sh, docs): detect host arch for fd-router download; update router docs Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/7a6cb757-5f4d-4c45-9272-e1e3da43ede4 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * [Router] Move fd-router download from build.sh to setup.py - Remove download_fd_router from build.sh (setup.py handles it via CustomBdistWheel.run) - Add download_fd_router to setup.py with aarch64 support - Always register CustomBdistWheel in cmdclass (not gated by rdma_comm_supported) - Add fd-router binary to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Revert doc changes for router.md Will update docs in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Build] Add BUILD_WHEEL=2 mode to skip custom ops compilation When only Python/build scripts are changed, use `bash build.sh 2` to package the wheel directly without recompiling custom ops, significantly reducing build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Add deprecation warning to Python Router Print a warning when launching the Python Router, recommending the Golang Router for production use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Fix] Suppress noisy warnings and replace pkg_resources - Suppress transformers/paddleformers/setuptools warnings on startup - Replace pkg_resources with importlib.metadata to fix ModuleNotFoundError - Change sm_version print to logging.debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Test] Fix unit tests for eval.py and golang_router_launch - test_eval.py: replace pkg_resources with importlib.metadata mocks - test_golang_router_launch.py: use patch.object on _launch_module to avoid AttributeError from stub module resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: run pre-commit to fix code formatting issues Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/91173bf4-9b99-4cf4-b95b-0758fed8abfa --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…_router.launch) (PaddlePaddle#7673) * docs: update router documentation to use Python CLI (python -m fastdeploy.golang_router.launch) Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * docs: fix duplicate link in best_practices/Disaggregated.md Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…tion (additional fixes) (PaddlePaddle#7684)
…ort int32) (PaddlePaddle#7648) * fix infer seed * fix infer seed for mtp * fix offset * fix offset
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 14:07:03
📋 Review 摘要
PR 概述:新增 PaddleFleet 作为模型推理后端(--model-impl paddlefleet),通过替换 TransformerLayer 中的 core_attention 复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
变更范围:model_executor/models/paddleformers/、config.py、engine/args_utils.py、custom_ops/gpu_ops/
影响面 Tag:[Models] [OP] [FDConfig] [Engine]
📝 PR 规范检查
标题使用了中文括号 【Models】,不符合规范要求的 [Models] 英文方括号格式;Accuracy Tests 段仅有模板注释,无实际内容;Add unit tests 未勾选且未说明原因。
标题建议(可直接复制):
[Models] add fleet model fallback
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
新增 PaddleFleet 作为模型推理后端(`--model-impl paddlefleet`),通过将 PaddleFleet TransformerLayer 中的 `core_attention` 替换为 FastDeploy Attention 内核,实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
## Modifications
- `config.py`: 新增 `paddlefleet` 到 `ModelImpl` 类型定义
- `engine/args_utils.py`: 支持 `--model-impl paddlefleet` CLI 参数,并补充校验逻辑
- `model_executor/models/paddleformers/base_fleet.py`: 新增 `PaddleFleetModelBase` 基类、`FastDeployAttention` 层及 `patch_paddlefleet_core_attention` 替换函数
- `model_executor/models/paddleformers/__init__.py`: 注册 `PaddleFleetForCausalLM` 模型类
- `custom_ops/gpu_ops/fused_cast_sigmoid_bias.cu`: 新增 MoE gating 融合算子(cast + sigmoid + bias)
- `custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu`: 新增 logprob 采样参数构建算子
- `custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu`: 将 `SpeculateGetTargetLogits` 重命名并扩展为 `SpeculateGetAcceptTokensAndLogits`(接口新增 2 个参数)
## Usage or Command
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--model-impl paddlefleet
```
## Accuracy Tests
N/A(本 PR 新增 PaddleFleet 推理后端,尚未提供与参考实现的 logits 对齐数据)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu:1 |
新增 CUDA 文件未加入 setup_ops.py 编译列表,运行时将报错 |
| 📝 PR 规范 | — | 标题使用中文括号;Accuracy Tests 段为空;Add unit tests 未说明原因 |
总体评价
PaddleFleet backend 的整体架构设计合理,config.py/args_utils.py/模型注册的三处同步均已完成。但 build_sampling_params_logprob.cu 未注册到 setup_ops.py 是阻塞性 bug,需在合入前修复。
| @@ -0,0 +1,129 @@ | |||
| // Copyright (c) 2026 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
🔴 Bug 新文件已添加但未注册到编译列表
build_sampling_params_logprob.cu 是本 PR 新增的 CUDA kernel 文件,但在 custom_ops/setup_ops.py 的源文件列表中缺少该文件的注册条目。
参考 fused_cast_sigmoid_bias.cu 的处理方式(已在 setup_ops.py 中正确添加),需要在 setup_ops.py 对应的源文件列表中补充该条目,否则该 kernel 不会被编译,Python 侧(sampler.py)对 build_sampling_params_logprob 的调用将在运行时失败。
建议在 setup_ops.py 中添加(参考同目录其他 speculate_decoding kernel 的位置):
"gpu_ops/speculate_decoding/build_sampling_params_logprob.cu",
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览存在 2 个 required 任务失败,6 个 required 任务运行中,需等待运行完成后再评估合并条件。
2 任务状态汇总2.1 Required任务 : 2/18 通过
2.2 可选任务 — 32/45 通过
3 失败详情(仅 required)Approval — 基础设施(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请相关 Reviewer 审批此 PR 链接: 查看日志 |
Motivation
新增 PaddleFleet 作为模型推理后端(
--model-impl paddlefleet),通过将 PaddleFleetTransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核,实现在
PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
Modifications
config.py: 新增paddlefleet到ModelImpl类型定义engine/args_utils.py: 支持--model-impl paddlefleetCLI 参数model_executor/models/paddleformers/base_fleet.py: 新增PaddleFleetModelBase基类和FastDeployAttention替换逻辑model_executor/models/paddleformers/__init__.py: 注册PaddleFleetForCausalLM模型类worker/worker_process.py: 同步新增 paddlefleet 选项Usage or Command
python -m fastdeploy.entrypoints.openai.api_server \ --model /path/to/model \ --model-impl paddlefleetAccuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.