Skip to content

【Models】add fleet model fallback#7730

Closed
xiaoguoguo626807 wants to merge 5174 commits intoPaddlePaddle:developfrom
xiaoguoguo626807:fd
Closed

【Models】add fleet model fallback#7730
xiaoguoguo626807 wants to merge 5174 commits intoPaddlePaddle:developfrom
xiaoguoguo626807:fd

Conversation

@xiaoguoguo626807
Copy link
Copy Markdown

@xiaoguoguo626807 xiaoguoguo626807 commented May 7, 2026

Motivation

新增 PaddleFleet 作为模型推理后端(--model-impl paddlefleet),通过将 PaddleFleet
TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核,实现在
PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

Modifications

  • config.py: 新增 paddlefleetModelImpl 类型定义
  • engine/args_utils.py: 支持 --model-impl paddlefleet CLI 参数
  • model_executor/models/paddleformers/base_fleet.py: 新增 PaddleFleetModelBase 基类和 FastDeployAttention 替换逻辑
  • model_executor/models/paddleformers/__init__.py: 注册 PaddleFleetForCausalLM 模型类
  • worker/worker_process.py: 同步新增 paddlefleet 选项

Usage or Command

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

liyonghua0910 and others added 30 commits March 31, 2026 10:52
…Generation (PaddlePaddle#7086)

Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
…le#6731)

* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充

* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充
---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: “liuruian” <liuruian@baidu.com>
* [BugFix] reset exist tasks signal in clear_data

* [Fix] fix stale exist tasks signal after weight update

* [Chore] downgrade detected new requests log to DEBUG level

* [fix] adjust continue place
* remove ENABLE_V1_DATA_PROCESSOR

* fix unit test

* fix unit test
…efine PD Disaggregation (PaddlePaddle#7107)

* Write the cache of preempted req to storage

* up

* fix
…e#7085)

* [CI] Optimize test execution with single-GPU parallelism and log collection

* remove export CUDA_VISIBLE_DEVICES

* fix path error

* fix log_* path and debug

* [CI] Optimize test execution with single-GPU parallelism and log collection
* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

---------

Co-authored-by: mouxin <mouxin@baidu.com>
* add docs for disaggregated deployment

* pre-commit run for style check

* update docs
* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Fix mixed cache-aware

---------

Co-authored-by: mouxin <mouxin@baidu.com>
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
… decoding operators (PaddlePaddle#7121)

- Fix accept_idx calculation in spec_set_value_by_stop_seqs
- Fix condition check from < to <= for token matching
- Fix accept_tokens indexing logic
- Remove unnecessary -1 in current_step comparison for max_think_len

Co-authored-by: guanshihui] <guanshihui@baidu.com>
* support deepgeem for sm103

* add assert

* modify code style

* add assert

* modify sm version condition

* remove assert
…ight update and add unit tests (PaddlePaddle#7083)

* [test] add a few unit tests

* [feat] update key prefix when model weights are updated

* [test] try to fix test_worker_process
kevincheng2 and others added 17 commits April 29, 2026 19:51
…dlePaddle#7471)

* [BugFix][KVCache] Fix inference slowdown when enabling CPU cache on Blackwell GPU

在 B 卡(Blackwell GPU)上开启 CPU cache(num_cpu_blocks > 0)时,推理性能出现明显降速。
根因是 `create_cache_tensor` 的判断逻辑将 `num_cpu_blocks > 0` 作为跳过 GPU cache tensor 创建的条件,导致 B 卡上错误地跳过了 GPU cache tensor 的初始化。

- `fastdeploy/worker/gpu_model_runner.py`:`create_cache_tensor` 判断中移除 `num_cpu_blocks > 0` 条件(两处:`init_cache` 和 `clear_cache`),保证开启 CPU cache 时 GPU cache tensor 仍正常创建
- `fastdeploy/cache_manager/prefix_cache_manager.py`:将 `--create_cache_tensor` 参数从非 splitwise 场景的条件判断中移出,统一归到 `kvcache_storage_backend` 配置路径下,逻辑更清晰

```bash
python -m fastdeploy.entrypoints.openai.api_server \
  --num-cpu-blocks <N> \
  ...
```

* [BugFix][KVCache] Enlarge prealloc threshold for speculative decoding

## Motivation

投机解码场景下,每个调度步骤一次性消耗 `num_spec_tokens` 个 slot,原有的
`prealloc_dec_block_slot_num_threshold` 阈值偏小,导致块预分配触发不够及时,
影响推理性能。

## Modifications

在 `FDConfig` 初始化阶段,当启用 speculative decoding 时,将
`prealloc_dec_block_slot_num_threshold` 扩大为原值乘以 `num_spec_tokens`,
同时确保不超过 enc_dec_block 容量上限。

## Usage or Command

启用投机解码时,无需额外配置,阈值自动调整:

```bash
python -m fastdeploy.entrypoints.openai.api_server \
  --speculative-config '{"method": "draft_model", "num_speculative_tokens": 4}' \
  ...
```

* [BugFix][KVCache][FDConfig] Fix prealloc threshold and create_cache_tensor for splitwise

## Motivation

两处 bug 修复:
1. speculative decoding 场景下,prealloc_dec_block_slot_num_threshold 的放大系数应为 (num_spec_tokens + 1) 而非 num_spec_tokens,确保预分配触发时机足够提前。
2. kvcache_storage_backend 启用时,--create_cache_tensor 参数只应在非 splitwise 模式下传入,避免 splitwise P 节点错误创建 cache tensor。

## Modifications

- fastdeploy/config.py: 修正 prealloc 放大系数为 (num_spec_tokens + 1),并添加 logger 打印变动前后的值
- fastdeploy/cache_manager/prefix_cache_manager.py: --create_cache_tensor 仅在非 splitwise 模式下追加

## Usage or Command

```bash
# 启动服务(含 speculative decoding + kvcache storage)
python -m fastdeploy.entrypoints.openai.api_server \
  --model <model_path> \
  --speculative-model <draft_model_path> \
  --num-speculative-tokens 3 \
  --kvcache-storage-backend <backend>
```

* [BugFix][SpecDecode] Fix create_cache_tensor condition in MTPProposer

## Motivation

`create_cache_tensor` 的判断条件中包含 `num_cpu_blocks > 0`,导致在 B卡 CPU cache 场景下,MTP 的 kv cache 创建逻辑出现异常。

## Modifications

移除 `create_cache_tensor` 判断中的 `num_cpu_blocks > 0` 条件,仅保留 `kvcache_storage_backend` 和 `splitwise_role` 的判断,避免冗余条件干扰。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* [BugFix][KVCache] Address review comments: fix negative cap, sync runner fixes, update comments

## Modifications

1. **config.py**: Fix `enc_dec_block_num=0` causing negative upper bound for
   `prealloc_dec_block_slot_num_threshold`. Use `max(0, ...)` to guard against
   negative cap. Also fix comment to say `num_spec_tokens + 1` (matching code).

2. **xpu_model_runner.py / metax_model_runner.py**: Sync the same fix from
   gpu_model_runner.py — remove `num_cpu_blocks > 0` from `create_cache_tensor`
   condition. CPU cache enablement should not prevent GPU runners from creating
   GPU cache tensors on XPU/Metax platforms either.

3. **gpu_model_runner.py / mtp.py / xpu_model_runner.py / metax_model_runner.py**:
   Update stale comments to clarify that CPU cache does NOT prevent GPU cache
   tensor creation; cache transfer manager handles CPU<->GPU swap on top.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…rted (PaddlePaddle#7633)

* [BugFix] fix preempted token id not returned when a full batch is aborted

* [fix] changed fake_sampled_token_ids shape and filled value

* [test] add test

* [chore] move code place

* [test] add more tests and docstring
…lePaddle#7668)

* [Router] Support launch golang-router by python command

* Update fastdeploy/golang_router/launch.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/golang_router/launch.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix(golang_router): fix launch.py bugs and add unit tests

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/57636bb1-779a-417f-934c-07a1462ed41c

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* fix(build.sh, docs): detect host arch for fd-router download; update router docs

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/7a6cb757-5f4d-4c45-9272-e1e3da43ede4

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* [Router] Move fd-router download from build.sh to setup.py

- Remove download_fd_router from build.sh (setup.py handles it via CustomBdistWheel.run)
- Add download_fd_router to setup.py with aarch64 support
- Always register CustomBdistWheel in cmdclass (not gated by rdma_comm_supported)
- Add fd-router binary to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Router] Revert doc changes for router.md

Will update docs in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Build] Add BUILD_WHEEL=2 mode to skip custom ops compilation

When only Python/build scripts are changed, use `bash build.sh 2` to
package the wheel directly without recompiling custom ops, significantly
reducing build time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Router] Add deprecation warning to Python Router

Print a warning when launching the Python Router, recommending
the Golang Router for production use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Fix] Suppress noisy warnings and replace pkg_resources

- Suppress transformers/paddleformers/setuptools warnings on startup
- Replace pkg_resources with importlib.metadata to fix ModuleNotFoundError
- Change sm_version print to logging.debug

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Test] Fix unit tests for eval.py and golang_router_launch

- test_eval.py: replace pkg_resources with importlib.metadata mocks
- test_golang_router_launch.py: use patch.object on _launch_module to
  avoid AttributeError from stub module resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: run pre-commit to fix code formatting issues

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/91173bf4-9b99-4cf4-b95b-0758fed8abfa

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…_router.launch) (PaddlePaddle#7673)

* docs: update router documentation to use Python CLI (python -m fastdeploy.golang_router.launch)

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* docs: fix duplicate link in best_practices/Disaggregated.md

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/955dfc67-4288-4687-bd5a-b7b232fa97e7

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…ort int32) (PaddlePaddle#7648)

* fix infer seed

* fix infer seed for mtp

* fix offset

* fix offset
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 7, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
29 out of 31 committers have signed the CLA.

✅ Deleter-D
✅ EmmonsCurse
✅ xjkmfa
✅ ChowMingSing
✅ cmcamdy
✅ plusNew001
✅ wuyujiji
✅ chang-wenbin
✅ juncaipeng
✅ Jiang-Jia-Jun
✅ iosmers
✅ BingooYang
✅ ApplEOFDiscord
✅ jackyYang6
✅ kevincheng2
✅ zhoutianzi666
✅ zoooo0820
✅ qwes5s5
✅ Dryoung95
✅ zhupengyang
✅ ckl117
✅ liyonghua0910
✅ lizhenyun01
✅ Jiajun-Ji
✅ luukunn
✅ xyxinyang
✅ xiaoguoguo626807
✅ gongshaotian
✅ K11OntheBoat
❌ rain7996
❌ Copilot
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@xiaoguoguo626807 xiaoguoguo626807 changed the title Fd 【Models】add fleet model fallback May 7, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 14:30:35

📋 Review 摘要

PR 概述:新增 PaddleFleet 作为模型推理后端(--model-impl paddlefleet),通过替换 TransformerLayer 中的 core_attention 为 FastDeploy Attention 内核,实现 KV Cache 与高性能 Attention 复用。

变更范围model_executor/models/paddleformers/config.pyengine/args_utils.pyworker/worker_process.pyrequirements.txt

影响面 Tag[Models] [FDConfig] [Engine]


📝 PR 规范检查

标题使用了中文全角括号 【Models】,不符合 D1 规范要求的 ASCII 方括号格式 [Models]Accuracy Tests 段仅保留了注释占位符,内容为空,但 Checklist 中已勾选"Provide accuracy results",存在矛盾。

标题建议(可直接复制):

  • [Models] add PaddleFleet model fallback backend

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
新增 PaddleFleet 作为模型推理后端(`--model-impl paddlefleet`),通过将 PaddleFleet TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核,实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

## Modifications
- `config.py`: 新增 `paddlefleet``ModelImpl` 类型定义
- `engine/args_utils.py`: 支持 `--model-impl paddlefleet` CLI 参数,更新合法值列表与文档字符串
- `model_executor/models/paddleformers/base_fleet.py`: 新增 `PaddleFleetModelBase` 基类和 `FastDeployAttention` 替换逻辑,实现 `patch_paddlefleet_core_attention` 函数
- `model_executor/models/paddleformers/__init__.py`: 注册 `PaddleFleetForCausalLM` 模型类
- `model_executor/graph_optimization/decorator.py`: 修复 `__call__` 支持位置参数 `*args`
- `model_executor/layers/rotary_embedding.py`: `get_rope_impl` 新增对 `PaddleFleetForCausalLM` 的架构名称解析
- `worker/worker_process.py`: 同步新增 `paddlefleet` 选项

## Usage or Command
```bash
python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet
```

## Accuracy Tests
N/A(本 PR 为新增后端框架集成,尚无精度对比数据;后续需补充与原生 FastDeploy 后端的 logits 对齐测试)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug base_fleet.py:119 assert 用于运行时校验,-O 优化模式下断言被跳过,forward_meta=None 时静默传递引发后续难以定位的错误
🟡 建议 base_fleet.py:372,395 print() 遗留在 forward 热路径,每次推理均产生控制台输出,严重影响吞吐性能
❓ 疑问 requirements.txt paddleformers 锁定到特定 nightly 离线 whl(cu126 only),不适合长期维护
❓ 疑问 base_fleet.py layer_number 注释说明"PaddleFleet 从 1 开始",但直接赋值 fd_layer_id = layer_number(注释又说"0-indexed"),存在 off-by-one 风险

🔴 base_fleet.py:119 — assert 用于运行时校验

assert 在 Python -O 优化模式下会被完全跳过,forward_meta is NoneNone 会静默流入 fd_attention.forward(),产生难以定位的 AttributeError。参照同目录 base.py:285 的模式改为显式 raise ValueError

🟡 base_fleet.py:372,395 — print() 遗留热路径

两处 print() 遗留在 @paddle.no_grad() def forward() 热路径中,每次推理都会触发字符串格式化与控制台 I/O(line 395 还会序列化整个 Tensor)。请删除或替换为 logger.debug()

# line 372: 删除或改为
logger.debug("forward_meta: %s", forward_meta)
# line 395: 删除或改为
logger.debug("position_ids: %s", position_ids)

requirements.txt — nightly wheel 锁定问题

paddleformers[paddlefleet] @https://paddle-whl.bj.bcebos.com/nightly/cu126/paddleformers/paddleformers-1.1.0.post20260430-py3-none-any.whl 存在以下问题:

  1. 硬编码 URL,链接失效后无法安装
  2. 仅覆盖 CUDA 12.6,其他 CUDA 版本用户无法使用
  3. nightly 版本不具备版本保证,不适合生产环境

建议改为正式 release 版本,或在 PR 描述中说明此为临时措施并提出后续计划。

layer_number off-by-one 疑问

注释 # Get layer_number (PaddleFleet starts from 1) 表明 layer_number 从 1 起,但随后 fd_layer_id = layer_number 的注释是 # Get FastDeploy layer ID (0-indexed),两者矛盾。若 FastDeploy KV Cache 按 0-indexed 分配(layer 0 ~ N-1),则 fd_layer_id 应为 layer_number - 1,否则 layer 0 的 KV Cache 永远不会被使用,最后一层可能越界。请确认 Attention(fd_config, layer_id=fd_layer_id)layer_id 的预期取值范围。


总体评价

整体方案清晰,通过 patch_paddlefleet_core_attention 将 FastDeploy Attention 无侵入地嵌入 PaddleFleet 模型结构,思路合理。但有一处 P0 问题(assert 替代 raise ValueError)必须修复,另有两处 print() 调试输出遗留在热路径中影响性能,以及 requirements.txt 的 nightly wheel 锁定和 layer_number 索引的疑问需作者确认后方可合入。

"""
# Try to get forward_meta from config (PaddleFleet does not pass this parameter when calling)
forward_meta = getattr(self.config, "forward_meta", None)
assert forward_meta is not None, "forward_meta must be provided"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug assert 被用于运行时校验,Python -O 优化模式下断言会被完全跳过,导致 forward_meta is NoneNone 静默流入 fd_attention.forward(),引发难以定位的 AttributeError

参照同目录 base.py:285 的做法,改为显式 raise ValueError

if forward_meta is None:
    raise ValueError("forward_meta must be provided")

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 14:53:06

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

无 Required 检查配置,当前无失败任务;1 个可选任务运行中,CI 尚未全部完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
3(0) 3 1 0 1 0 1

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未配置必选任务(Branch Protection Rules 未设置 Required Checks),无阻塞合并的任务。

2.2 可选任务 — 1/3 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - Job -
Remove skip-ci labels on new commits 5s - -
⏭️ cherry-pick(已跳过) - - -

3 失败详情(仅 required)

无 required 失败任务。

@xiaoguoguo626807 xiaoguoguo626807 deleted the fd branch May 7, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.