[Feature] Add TritonBF16MoEMethod for BF16 MoE inference#7734
[Feature] Add TritonBF16MoEMethod for BF16 MoE inference#7734xuanyuanminzheng wants to merge 4 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 仍在运行中,5 个 required 任务运行中,1 个 required 任务等待中,当前无 required 失败任务。
2 任务状态汇总2.1 Required任务 : 4/10 通过
2.2 可选任务 — 24/28 通过
3 失败详情(仅 required)无 required 失败任务。 |
There was a problem hiding this comment.
Pull request overview
该 PR 为 BF16 未量化 MoE 场景新增 Triton 原生 kernel 后端(TritonBF16MoEMethod),并通过环境变量 FD_MOE_BACKEND=triton 在 CUDA 平台启用,以补齐现有 Cutlass/量化路径无法覆盖的 BF16 unquantized 推理链路。
Changes:
- 新增 BF16 Triton MoE kernel:
fused_moe_kernel_bf16,支持 BF16 计算/累加与 routed-weight 融合乘法。 - 新增
TritonBF16MoEMethod:实现 BF16 FusedMoE forward(routing → preprocess → GEMM1 → SwiGLU → GEMM2(+router weight) → topk reduce)。 - 扩展
get_moe_method与__init__.py导出,并补充单测/精度对比测试。
建议同时确认是否需要在文档(如环境变量说明)中补充/强调 FD_MOE_BACKEND=triton 的 依赖条件(Triton 可用 + BF16 dtype) 与适用范围。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/layers/test_fused_moe_triton_backend.py | 新增 TritonBF16MoEMethod 单测与 Triton vs Cutlass BF16 精度对比测试 |
| fastdeploy/model_executor/layers/moe/triton_moe_kernels.py | 新增 BF16 Triton fused MoE GEMM kernel(fused_moe_kernel_bf16) |
| fastdeploy/model_executor/layers/moe/moe.py | 增加 FD_MOE_BACKEND=triton 分支选择 TritonBF16MoEMethod |
| fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py | 新增 TritonBF16MoEMethod 实现与 Triton/ops 导入 |
| fastdeploy/model_executor/layers/moe/init.py | 导出 TritonBF16MoEMethod |
| b = tl.load( | ||
| b_ptrs, | ||
| mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, | ||
| other=0.0, | ||
| ) |
| try: | ||
| import triton.language as tl | ||
|
|
||
| from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func | ||
|
|
||
| from .triton_moe_kernels import fused_moe_kernel_paddle | ||
| from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle |
| if current_platform.is_cuda(): | ||
| moe_backend = envs.FD_MOE_BACKEND.lower() | ||
| if moe_backend == "triton": | ||
| from .fused_moe_triton_backend import TritonBF16MoEMethod | ||
|
|
| @pytest.mark.skipif(not paddle.is_compiled_with_cuda(), reason="requires CUDA") | ||
| class TestTritonBF16MoEPrecision: | ||
| """ | ||
| Precision tests: Triton BF16 path vs. Cutlass BF16 path. | ||
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7734 +/- ##
==========================================
Coverage ? 63.45%
==========================================
Files ? 461
Lines ? 64145
Branches ? 9814
==========================================
Hits ? 40705
Misses ? 20644
Partials ? 2796
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| # --- 1. Routing --- | ||
| gate_out = gate(x) | ||
| gate_out = gate_out.cast("float32") |
There was a problem hiding this comment.
| def create_weights(self, layer: nn.Layer, **extra_weight_attrs): | ||
| """ | ||
| Reuse UnquantizedFusedMoEMethod weight creation logic. | ||
| Weight shapes on CUDA (non-torch format): | ||
| up_gate_proj_weight: [E, hidden_size, moe_intermediate_size * 2] (K-major) | ||
| down_proj_weight: [E, moe_intermediate_size, hidden_size] (K-major) | ||
| The Triton kernel reads B as [E, K, N] which maps directly to these shapes. | ||
| """ | ||
| from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import ( | ||
| UnquantizedFusedMoEMethod, | ||
| ) | ||
|
|
||
| UnquantizedFusedMoEMethod.create_weights(self, layer, **extra_weight_attrs) | ||
|
|
||
| def process_weights_after_loading(self, layer: nn.Layer): | ||
| from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import ( | ||
| UnquantizedFusedMoEMethod, | ||
| ) | ||
|
|
||
| UnquantizedFusedMoEMethod.process_weights_after_loading(self, layer) |
There was a problem hiding this comment.
TritonBF16MoEMethod是不是可以继承自UnquantizedFusedMoEMethod呢?感觉这样会更好
| ) | ||
|
|
||
|
|
||
| class TritonBF16MoEMethod(QuantMethodBase): |
There was a problem hiding this comment.
个人建议不需要强调是BF16,感觉推理场景中默认的理解就是BF16精度,直接就是TritonMoEMethod呢?
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 17:12:44
📋 Review 摘要
PR 概述:为 BF16 unquantized MoE 场景新增 TritonMoEMethod,通过环境变量 FD_MOE_BACKEND=triton 激活 Triton 原生 BF16 kernel 推理路径。
变更范围:fastdeploy/model_executor/layers/moe/(新增类、新增 kernel、路由分支)、tests/layers/
影响面 Tag:[OP]
📝 PR 规范检查
标题格式合规([Feature] 为官方 Tag),PR 描述包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 所有必填段,结构核验通过。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fused_moe_triton_backend.py:27 |
triton_moe_kernels 导入移出 try/except,triton 未安装时整包加载失败 |
| 🟡 建议 | triton_moe_kernels.py:245 |
naive_block_assignment 参数已注释掉,但 docstring 仍保留完整描述,文档与接口不符 |
问题详情
🔴 Bug — 无保护导入(P0)
fused_moe_triton_backend.py:25-28
本次重构将原 try/except 块内的 from .triton_moe_kernels import fused_moe_kernel_paddle 移出保护范围,同时新增 fused_moe_kernel_bf16 也在保护外直接导入:
# 当前(无保护)
from fastdeploy.model_executor.layers.moe.triton_moe_kernels import (
fused_moe_kernel_bf16,
fused_moe_kernel_paddle,
)而 triton_moe_kernels.py 顶部有无条件的 import triton / import triton.language as tl(第 17-18 行)。若 triton 未安装,fused_moe_triton_backend.py 模块加载时就会抛出 ImportError,该异常经由 __init__.py 的直接导入传播,导致整个 fastdeploy.model_executor.layers.moe 包无法加载,影响所有 MoE 场景(包括原有 Cutlass 路径)。
建议修复:
try:
import triton.language as tl
from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func
from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle
except ImportError:
pass🟡 建议 — docstring 残留描述(P1)
triton_moe_kernels.py:258
fused_moe_kernel_bf16 的 docstring 详细描述了 naive_block_assignment=True 的工作模式,但该参数已被注释掉(第 245 行)且相关实现代码全部注释(第 284-292 行)。当前函数签名中不存在该参数,文档描述形同幽灵接口,容易误导使用者。
建议删除 docstring 中 When naive_block_assignment=True... 一段,或改为:
# TODO: naive_block_assignment mode not yet implemented
总体评价
新增的 TritonMoEMethod 整体设计合理,GEMM1/GEMM2 的 BF16 路由流程、router weight 融合、tile 启发式配置均与 vLLM 参考实现对齐,测试覆盖较为充分。但 triton_moe_kernels 模块导入从 try/except 保护中移出是本次重构的回归点,会导致 triton 未安装环境下整个 moe 包不可用,需修复后合入。
Motivation
为 BF16 unquantized MoE 场景新增 Triton 原生 kernel 后端(
TritonBF16MoEMethod),通过环境变量FD_MOE_BACKEND=triton激活。原有 Cutlass/量化路径无法直接处理 BF16 未量化权重,本 PR 补充该路径,支持更广泛的 BF16 模型推理场景。Modifications
fused_moe_triton_backend.py:新增TritonBF16MoEMethod类,继承QuantMethodBase,实现完整的 BF16 FusedMoE forward 流程(路由 → preprocess → Triton GEMM1 → SwiGLU → Triton GEMM2 + router weight 融合)triton_moe_kernels.py:新增fused_moe_kernel_bf16Triton kernel,支持 BF16 累加、int64 token 索引防溢出、router weight 融合乘法(MUL_ROUTED_WEIGHT)moe.py:get_moe_method中新增FD_MOE_BACKEND=triton分支,返回TritonBF16MoEMethod__init__.py:导出TritonBF16MoEMethodUsage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.