【Models】add fleet model fallback#7730
【Models】add fleet model fallback#7730xiaoguoguo626807 wants to merge 5174 commits intoPaddlePaddle:developfrom
Conversation
…Generation (PaddlePaddle#7086) Add clear_grpah_opt_backend method that delegates to the underlying model to clear cuda graph optimization backend. Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
…le#6731) * [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充 * [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充 --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: “liuruian” <liuruian@baidu.com>
* [BugFix] reset exist tasks signal in clear_data * [Fix] fix stale exist tasks signal after weight update * [Chore] downgrade detected new requests log to DEBUG level * [fix] adjust continue place
* remove ENABLE_V1_DATA_PROCESSOR * fix unit test * fix unit test
…efine PD Disaggregation (PaddlePaddle#7107) * Write the cache of preempted req to storage * up * fix
…e#7085) * [CI] Optimize test execution with single-GPU parallelism and log collection * remove export CUDA_VISIBLE_DEVICES * fix path error * fix log_* path and debug * [CI] Optimize test execution with single-GPU parallelism and log collection
* [Feature] Config eviction_duration * [Feature] Config eviction_duration * [Feature] Config eviction_duration * [Feature] Config eviction_duration --------- Co-authored-by: mouxin <mouxin@baidu.com>
* add docs for disaggregated deployment * pre-commit run for style check * update docs
* [Feature] Config eviction_duration * [Feature] Config eviction_duration * [Feature] Config eviction_duration * [Feature] Config eviction_duration * [Feature] Fix mixed cache-aware --------- Co-authored-by: mouxin <mouxin@baidu.com>
* [XPU] support speculate_pre_process * merge develop * fix codestype * fix mtp, support cu_seqlens_q_output * fix mtp, support cu_seqlens_q_output * fix test --------- Co-authored-by: lizan1999 <lizan03@baidu.com>
… decoding operators (PaddlePaddle#7121) - Fix accept_idx calculation in spec_set_value_by_stop_seqs - Fix condition check from < to <= for token matching - Fix accept_tokens indexing logic - Remove unnecessary -1 in current_step comparison for max_think_len Co-authored-by: guanshihui] <guanshihui@baidu.com>
* support deepgeem for sm103 * add assert * modify code style * add assert * modify sm version condition * remove assert
* fix tool parser
…peculate…" (PaddlePaddle#7133) This reverts commit 9c0c5d6.
…ight update and add unit tests (PaddlePaddle#7083) * [test] add a few unit tests * [feat] update key prefix when model weights are updated * [test] try to fix test_worker_process
…dlePaddle#7471) * [BugFix][KVCache] Fix inference slowdown when enabling CPU cache on Blackwell GPU 在 B 卡(Blackwell GPU)上开启 CPU cache(num_cpu_blocks > 0)时,推理性能出现明显降速。 根因是 `create_cache_tensor` 的判断逻辑将 `num_cpu_blocks > 0` 作为跳过 GPU cache tensor 创建的条件,导致 B 卡上错误地跳过了 GPU cache tensor 的初始化。 - `fastdeploy/worker/gpu_model_runner.py`:`create_cache_tensor` 判断中移除 `num_cpu_blocks > 0` 条件(两处:`init_cache` 和 `clear_cache`),保证开启 CPU cache 时 GPU cache tensor 仍正常创建 - `fastdeploy/cache_manager/prefix_cache_manager.py`:将 `--create_cache_tensor` 参数从非 splitwise 场景的条件判断中移出,统一归到 `kvcache_storage_backend` 配置路径下,逻辑更清晰 ```bash python -m fastdeploy.entrypoints.openai.api_server \ --num-cpu-blocks <N> \ ... ``` * [BugFix][KVCache] Enlarge prealloc threshold for speculative decoding ## Motivation 投机解码场景下,每个调度步骤一次性消耗 `num_spec_tokens` 个 slot,原有的 `prealloc_dec_block_slot_num_threshold` 阈值偏小,导致块预分配触发不够及时, 影响推理性能。 ## Modifications 在 `FDConfig` 初始化阶段,当启用 speculative decoding 时,将 `prealloc_dec_block_slot_num_threshold` 扩大为原值乘以 `num_spec_tokens`, 同时确保不超过 enc_dec_block 容量上限。 ## Usage or Command 启用投机解码时,无需额外配置,阈值自动调整: ```bash python -m fastdeploy.entrypoints.openai.api_server \ --speculative-config '{"method": "draft_model", "num_speculative_tokens": 4}' \ ... ``` * [BugFix][KVCache][FDConfig] Fix prealloc threshold and create_cache_tensor for splitwise ## Motivation 两处 bug 修复: 1. speculative decoding 场景下,prealloc_dec_block_slot_num_threshold 的放大系数应为 (num_spec_tokens + 1) 而非 num_spec_tokens,确保预分配触发时机足够提前。 2. kvcache_storage_backend 启用时,--create_cache_tensor 参数只应在非 splitwise 模式下传入,避免 splitwise P 节点错误创建 cache tensor。 ## Modifications - fastdeploy/config.py: 修正 prealloc 放大系数为 (num_spec_tokens + 1),并添加 logger 打印变动前后的值 - fastdeploy/cache_manager/prefix_cache_manager.py: --create_cache_tensor 仅在非 splitwise 模式下追加 ## Usage or Command ```bash # 启动服务(含 speculative decoding + kvcache storage) python -m fastdeploy.entrypoints.openai.api_server \ --model <model_path> \ --speculative-model <draft_model_path> \ --num-speculative-tokens 3 \ --kvcache-storage-backend <backend> ``` * [BugFix][SpecDecode] Fix create_cache_tensor condition in MTPProposer ## Motivation `create_cache_tensor` 的判断条件中包含 `num_cpu_blocks > 0`,导致在 B卡 CPU cache 场景下,MTP 的 kv cache 创建逻辑出现异常。 ## Modifications 移除 `create_cache_tensor` 判断中的 `num_cpu_blocks > 0` 条件,仅保留 `kvcache_storage_backend` 和 `splitwise_role` 的判断,避免冗余条件干扰。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] Address review comments: fix negative cap, sync runner fixes, update comments ## Modifications 1. **config.py**: Fix `enc_dec_block_num=0` causing negative upper bound for `prealloc_dec_block_slot_num_threshold`. Use `max(0, ...)` to guard against negative cap. Also fix comment to say `num_spec_tokens + 1` (matching code). 2. **xpu_model_runner.py / metax_model_runner.py**: Sync the same fix from gpu_model_runner.py — remove `num_cpu_blocks > 0` from `create_cache_tensor` condition. CPU cache enablement should not prevent GPU runners from creating GPU cache tensors on XPU/Metax platforms either. 3. **gpu_model_runner.py / mtp.py / xpu_model_runner.py / metax_model_runner.py**: Update stale comments to clarify that CPU cache does NOT prevent GPU cache tensor creation; cache transfer manager handles CPU<->GPU swap on top. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…dlePaddle#7671) * support routed_scaling_factor_learnable
…rted (PaddlePaddle#7633) * [BugFix] fix preempted token id not returned when a full batch is aborted * [fix] changed fake_sampled_token_ids shape and filled value * [test] add test * [chore] move code place * [test] add more tests and docstring
…lePaddle#7668) * [Router] Support launch golang-router by python command * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fastdeploy/golang_router/launch.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix(golang_router): fix launch.py bugs and add unit tests Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/57636bb1-779a-417f-934c-07a1462ed41c Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * fix(build.sh, docs): detect host arch for fd-router download; update router docs Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/7a6cb757-5f4d-4c45-9272-e1e3da43ede4 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * [Router] Move fd-router download from build.sh to setup.py - Remove download_fd_router from build.sh (setup.py handles it via CustomBdistWheel.run) - Add download_fd_router to setup.py with aarch64 support - Always register CustomBdistWheel in cmdclass (not gated by rdma_comm_supported) - Add fd-router binary to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Revert doc changes for router.md Will update docs in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Build] Add BUILD_WHEEL=2 mode to skip custom ops compilation When only Python/build scripts are changed, use `bash build.sh 2` to package the wheel directly without recompiling custom ops, significantly reducing build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Router] Add deprecation warning to Python Router Print a warning when launching the Python Router, recommending the Golang Router for production use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Fix] Suppress noisy warnings and replace pkg_resources - Suppress transformers/paddleformers/setuptools warnings on startup - Replace pkg_resources with importlib.metadata to fix ModuleNotFoundError - Change sm_version print to logging.debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Test] Fix unit tests for eval.py and golang_router_launch - test_eval.py: replace pkg_resources with importlib.metadata mocks - test_golang_router_launch.py: use patch.object on _launch_module to avoid AttributeError from stub module resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: run pre-commit to fix code formatting issues Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/91173bf4-9b99-4cf4-b95b-0758fed8abfa --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…_router.launch) (PaddlePaddle#7673) * docs: update router documentation to use Python CLI (python -m fastdeploy.golang_router.launch) Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * docs: fix duplicate link in best_practices/Disaggregated.md Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7 Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…tion (additional fixes) (PaddlePaddle#7684)
…ort int32) (PaddlePaddle#7648) * fix infer seed * fix infer seed for mtp * fix offset * fix offset
|
|
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 14:30:35
📋 Review 摘要
PR 概述:新增 PaddleFleet 作为模型推理后端(--model-impl paddlefleet),通过替换 TransformerLayer 中的 core_attention 为 FastDeploy Attention 内核,实现 KV Cache 与高性能 Attention 复用。
变更范围:model_executor/models/paddleformers/、config.py、engine/args_utils.py、worker/worker_process.py、requirements.txt
影响面 Tag:[Models] [FDConfig] [Engine]
📝 PR 规范检查
标题使用了中文全角括号 【Models】,不符合 D1 规范要求的 ASCII 方括号格式 [Models];Accuracy Tests 段仅保留了注释占位符,内容为空,但 Checklist 中已勾选"Provide accuracy results",存在矛盾。
标题建议(可直接复制):
[Models] add PaddleFleet model fallback backend
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
新增 PaddleFleet 作为模型推理后端(`--model-impl paddlefleet`),通过将 PaddleFleet TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核,实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
## Modifications
- `config.py`: 新增 `paddlefleet` 到 `ModelImpl` 类型定义
- `engine/args_utils.py`: 支持 `--model-impl paddlefleet` CLI 参数,更新合法值列表与文档字符串
- `model_executor/models/paddleformers/base_fleet.py`: 新增 `PaddleFleetModelBase` 基类和 `FastDeployAttention` 替换逻辑,实现 `patch_paddlefleet_core_attention` 函数
- `model_executor/models/paddleformers/__init__.py`: 注册 `PaddleFleetForCausalLM` 模型类
- `model_executor/graph_optimization/decorator.py`: 修复 `__call__` 支持位置参数 `*args`
- `model_executor/layers/rotary_embedding.py`: `get_rope_impl` 新增对 `PaddleFleetForCausalLM` 的架构名称解析
- `worker/worker_process.py`: 同步新增 `paddlefleet` 选项
## Usage or Command
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--model-impl paddlefleet
```
## Accuracy Tests
N/A(本 PR 为新增后端框架集成,尚无精度对比数据;后续需补充与原生 FastDeploy 后端的 logits 对齐测试)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | base_fleet.py:119 |
assert 用于运行时校验,-O 优化模式下断言被跳过,forward_meta=None 时静默传递引发后续难以定位的错误 |
| 🟡 建议 | base_fleet.py:372,395 |
print() 遗留在 forward 热路径,每次推理均产生控制台输出,严重影响吞吐性能 |
| ❓ 疑问 | requirements.txt |
将 paddleformers 锁定到特定 nightly 离线 whl(cu126 only),不适合长期维护 |
| ❓ 疑问 | base_fleet.py |
layer_number 注释说明"PaddleFleet 从 1 开始",但直接赋值 fd_layer_id = layer_number(注释又说"0-indexed"),存在 off-by-one 风险 |
🔴 base_fleet.py:119 — assert 用于运行时校验
assert 在 Python -O 优化模式下会被完全跳过,forward_meta is None 时 None 会静默流入 fd_attention.forward(),产生难以定位的 AttributeError。参照同目录 base.py:285 的模式改为显式 raise ValueError。
🟡 base_fleet.py:372,395 — print() 遗留热路径
两处 print() 遗留在 @paddle.no_grad() def forward() 热路径中,每次推理都会触发字符串格式化与控制台 I/O(line 395 还会序列化整个 Tensor)。请删除或替换为 logger.debug():
# line 372: 删除或改为
logger.debug("forward_meta: %s", forward_meta)
# line 395: 删除或改为
logger.debug("position_ids: %s", position_ids)❓ requirements.txt — nightly wheel 锁定问题
paddleformers[paddlefleet] @https://paddle-whl.bj.bcebos.com/nightly/cu126/paddleformers/paddleformers-1.1.0.post20260430-py3-none-any.whl 存在以下问题:
- 硬编码 URL,链接失效后无法安装
- 仅覆盖 CUDA 12.6,其他 CUDA 版本用户无法使用
- nightly 版本不具备版本保证,不适合生产环境
建议改为正式 release 版本,或在 PR 描述中说明此为临时措施并提出后续计划。
❓ layer_number off-by-one 疑问
注释 # Get layer_number (PaddleFleet starts from 1) 表明 layer_number 从 1 起,但随后 fd_layer_id = layer_number 的注释是 # Get FastDeploy layer ID (0-indexed),两者矛盾。若 FastDeploy KV Cache 按 0-indexed 分配(layer 0 ~ N-1),则 fd_layer_id 应为 layer_number - 1,否则 layer 0 的 KV Cache 永远不会被使用,最后一层可能越界。请确认 Attention(fd_config, layer_id=fd_layer_id) 中 layer_id 的预期取值范围。
总体评价
整体方案清晰,通过 patch_paddlefleet_core_attention 将 FastDeploy Attention 无侵入地嵌入 PaddleFleet 模型结构,思路合理。但有一处 P0 问题(assert 替代 raise ValueError)必须修复,另有两处 print() 调试输出遗留在热路径中影响性能,以及 requirements.txt 的 nightly wheel 锁定和 layer_number 索引的疑问需作者确认后方可合入。
| """ | ||
| # Try to get forward_meta from config (PaddleFleet does not pass this parameter when calling) | ||
| forward_meta = getattr(self.config, "forward_meta", None) | ||
| assert forward_meta is not None, "forward_meta must be provided" |
There was a problem hiding this comment.
🔴 Bug assert 被用于运行时校验,Python -O 优化模式下断言会被完全跳过,导致 forward_meta is None 时 None 静默流入 fd_attention.forward(),引发难以定位的 AttributeError。
参照同目录 base.py:285 的做法,改为显式 raise ValueError:
if forward_meta is None:
raise ValueError("forward_meta must be provided")
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览无 Required 检查配置,当前无失败任务;1 个可选任务运行中,CI 尚未全部完成。
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/3 通过
3 失败详情(仅 required)无 required 失败任务。 |
Motivation
新增 PaddleFleet 作为模型推理后端(
--model-impl paddlefleet),通过将 PaddleFleetTransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核,实现在
PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。
Modifications
config.py: 新增paddlefleet到ModelImpl类型定义engine/args_utils.py: 支持--model-impl paddlefleetCLI 参数model_executor/models/paddleformers/base_fleet.py: 新增PaddleFleetModelBase基类和FastDeployAttention替换逻辑model_executor/models/paddleformers/__init__.py: 注册PaddleFleetForCausalLM模型类worker/worker_process.py: 同步新增 paddlefleet 选项Usage or Command
python -m fastdeploy.entrypoints.openai.api_server \ --model /path/to/model \ --model-impl paddlefleetAccuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.