Skip to content

[Feature] Add TritonBF16MoEMethod for BF16 MoE inference#7734

Open
xuanyuanminzheng wants to merge 4 commits into
PaddlePaddle:developfrom
xuanyuanminzheng:develop
Open

[Feature] Add TritonBF16MoEMethod for BF16 MoE inference#7734
xuanyuanminzheng wants to merge 4 commits into
PaddlePaddle:developfrom
xuanyuanminzheng:develop

Conversation

@xuanyuanminzheng
Copy link
Copy Markdown
Collaborator

@xuanyuanminzheng xuanyuanminzheng commented May 7, 2026

Motivation

为 BF16 unquantized MoE 场景新增 Triton 原生 kernel 后端(TritonBF16MoEMethod),通过环境变量 FD_MOE_BACKEND=triton 激活。原有 Cutlass/量化路径无法直接处理 BF16 未量化权重,本 PR 补充该路径,支持更广泛的 BF16 模型推理场景。

Modifications

  • fused_moe_triton_backend.py:新增 TritonBF16MoEMethod 类,继承 QuantMethodBase,实现完整的 BF16 FusedMoE forward 流程(路由 → preprocess → Triton GEMM1 → SwiGLU → Triton GEMM2 + router weight 融合)
  • triton_moe_kernels.py:新增 fused_moe_kernel_bf16 Triton kernel,支持 BF16 累加、int64 token 索引防溢出、router weight 融合乘法(MUL_ROUTED_WEIGHT
  • moe.pyget_moe_method 中新增 FD_MOE_BACKEND=triton 分支,返回 TritonBF16MoEMethod
  • __init__.py:导出 TritonBF16MoEMethod

Usage or Command

export FD_MOE_BACKEND=triton
# 启动推理服务(BF16 MoE 模型)即可自动使用 Triton BF16 后端

Accuracy Tests

image

Checklist

  • [√] Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • [√ ] Format your code, run pre-commit before commit.
  • [√ ] Add unit tests. Please write the reason in this PR if no unit tests.
  • [√ ] Provide accuracy results.
  • [√ ] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 17:14:52

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 仍在运行中,5 个 required 任务运行中,1 个 required 任务等待中,当前无 required 失败任务。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
38(0) 38 28 1 7 2 0

2 任务状态汇总

2.1 Required任务 : 4/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - CI 详情 -
Run Base Tests / base_tests - 运行中 - CI 详情 -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - CI 详情 -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - CI 详情 -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - CI 详情 -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 24/28 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 14m50s Job -
Run iluvatar Tests / run_iluvatar_cases - CI 详情 -
xpu_unit_test / run_xpu_unit_test - CI 详情 -
⏸️ CI_HPU - - -
其余 24 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 为 BF16 未量化 MoE 场景新增 Triton 原生 kernel 后端(TritonBF16MoEMethod),并通过环境变量 FD_MOE_BACKEND=triton 在 CUDA 平台启用,以补齐现有 Cutlass/量化路径无法覆盖的 BF16 unquantized 推理链路。

Changes:

  • 新增 BF16 Triton MoE kernel:fused_moe_kernel_bf16,支持 BF16 计算/累加与 routed-weight 融合乘法。
  • 新增 TritonBF16MoEMethod:实现 BF16 FusedMoE forward(routing → preprocess → GEMM1 → SwiGLU → GEMM2(+router weight) → topk reduce)。
  • 扩展 get_moe_method__init__.py 导出,并补充单测/精度对比测试。

建议同时确认是否需要在文档(如环境变量说明)中补充/强调 FD_MOE_BACKEND=triton依赖条件(Triton 可用 + BF16 dtype) 与适用范围。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/layers/test_fused_moe_triton_backend.py 新增 TritonBF16MoEMethod 单测与 Triton vs Cutlass BF16 精度对比测试
fastdeploy/model_executor/layers/moe/triton_moe_kernels.py 新增 BF16 Triton fused MoE GEMM kernel(fused_moe_kernel_bf16
fastdeploy/model_executor/layers/moe/moe.py 增加 FD_MOE_BACKEND=triton 分支选择 TritonBF16MoEMethod
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py 新增 TritonBF16MoEMethod 实现与 Triton/ops 导入
fastdeploy/model_executor/layers/moe/init.py 导出 TritonBF16MoEMethod

Comment thread fastdeploy/model_executor/layers/moe/triton_moe_kernels.py
Comment on lines +296 to +300
b = tl.load(
b_ptrs,
mask=offs_k[:, None] < K - k * BLOCK_SIZE_K,
other=0.0,
)
Comment on lines +37 to +42
try:
import triton.language as tl

from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func

from .triton_moe_kernels import fused_moe_kernel_paddle
from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle
Comment thread fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py
Comment on lines 56 to +60
if current_platform.is_cuda():
moe_backend = envs.FD_MOE_BACKEND.lower()
if moe_backend == "triton":
from .fused_moe_triton_backend import TritonBF16MoEMethod

Comment on lines +1265 to +1269
@pytest.mark.skipif(not paddle.is_compiled_with_cuda(), reason="requires CUDA")
class TestTritonBF16MoEPrecision:
"""
Precision tests: Triton BF16 path vs. Cutlass BF16 path.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cb2d7c0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...el_executor/layers/moe/fused_moe_triton_backend.py 98.43% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7734   +/-   ##
==========================================
  Coverage           ?   63.45%           
==========================================
  Files              ?      461           
  Lines              ?    64145           
  Branches           ?     9814           
==========================================
  Hits               ?    40705           
  Misses             ?    20644           
  Partials           ?     2796           
Flag Coverage Δ
GPU 72.62% <98.55%> (?)
XPU 7.13% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


# --- 1. Routing ---
gate_out = gate(x)
gate_out = gate_out.cast("float32")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qingqing01
qingqing01 previously approved these changes May 8, 2026
Comment on lines +1908 to +1927
def create_weights(self, layer: nn.Layer, **extra_weight_attrs):
"""
Reuse UnquantizedFusedMoEMethod weight creation logic.
Weight shapes on CUDA (non-torch format):
up_gate_proj_weight: [E, hidden_size, moe_intermediate_size * 2] (K-major)
down_proj_weight: [E, moe_intermediate_size, hidden_size] (K-major)
The Triton kernel reads B as [E, K, N] which maps directly to these shapes.
"""
from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import (
UnquantizedFusedMoEMethod,
)

UnquantizedFusedMoEMethod.create_weights(self, layer, **extra_weight_attrs)

def process_weights_after_loading(self, layer: nn.Layer):
from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import (
UnquantizedFusedMoEMethod,
)

UnquantizedFusedMoEMethod.process_weights_after_loading(self, layer)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TritonBF16MoEMethod是不是可以继承自UnquantizedFusedMoEMethod呢?感觉这样会更好

)


class TritonBF16MoEMethod(QuantMethodBase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

个人建议不需要强调是BF16,感觉推理场景中默认的理解就是BF16精度,直接就是TritonMoEMethod呢?

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 17:12:44

📋 Review 摘要

PR 概述:为 BF16 unquantized MoE 场景新增 TritonMoEMethod,通过环境变量 FD_MOE_BACKEND=triton 激活 Triton 原生 BF16 kernel 推理路径。

变更范围fastdeploy/model_executor/layers/moe/(新增类、新增 kernel、路由分支)、tests/layers/

影响面 Tag[OP]


📝 PR 规范检查

标题格式合规([Feature] 为官方 Tag),PR 描述包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 所有必填段,结构核验通过。


问题

级别 文件 概述
🔴 Bug fused_moe_triton_backend.py:27 triton_moe_kernels 导入移出 try/except,triton 未安装时整包加载失败
🟡 建议 triton_moe_kernels.py:245 naive_block_assignment 参数已注释掉,但 docstring 仍保留完整描述,文档与接口不符

问题详情

🔴 Bug — 无保护导入(P0)

fused_moe_triton_backend.py:25-28

本次重构将原 try/except 块内的 from .triton_moe_kernels import fused_moe_kernel_paddle 移出保护范围,同时新增 fused_moe_kernel_bf16 也在保护外直接导入:

# 当前(无保护)
from fastdeploy.model_executor.layers.moe.triton_moe_kernels import (
    fused_moe_kernel_bf16,
    fused_moe_kernel_paddle,
)

triton_moe_kernels.py 顶部有无条件的 import triton / import triton.language as tl(第 17-18 行)。若 triton 未安装,fused_moe_triton_backend.py 模块加载时就会抛出 ImportError,该异常经由 __init__.py 的直接导入传播,导致整个 fastdeploy.model_executor.layers.moe 包无法加载,影响所有 MoE 场景(包括原有 Cutlass 路径)。

建议修复:

try:
    import triton.language as tl
    from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func
    from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle
except ImportError:
    pass

🟡 建议 — docstring 残留描述(P1)

triton_moe_kernels.py:258

fused_moe_kernel_bf16 的 docstring 详细描述了 naive_block_assignment=True 的工作模式,但该参数已被注释掉(第 245 行)且相关实现代码全部注释(第 284-292 行)。当前函数签名中不存在该参数,文档描述形同幽灵接口,容易误导使用者。

建议删除 docstring 中 When naive_block_assignment=True... 一段,或改为:

# TODO: naive_block_assignment mode not yet implemented

总体评价

新增的 TritonMoEMethod 整体设计合理,GEMM1/GEMM2 的 BF16 路由流程、router weight 融合、tile 启发式配置均与 vLLM 参考实现对齐,测试覆盖较为充分。但 triton_moe_kernels 模块导入从 try/except 保护中移出是本次重构的回归点,会导致 triton 未安装环境下整个 moe 包不可用,需修复后合入。

Comment thread fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants