[Feature] Add TritonBF16MoEMethod for BF16 MoE inference by xuanyuanminzheng · Pull Request #7734 · PaddlePaddle/FastDeploy

xuanyuanminzheng · 2026-05-07T07:48:23Z

Motivation

为 BF16 unquantized MoE 场景新增 Triton 原生 kernel 后端（TritonBF16MoEMethod），通过环境变量 FD_MOE_BACKEND=triton 激活。原有 Cutlass/量化路径无法直接处理 BF16 未量化权重，本 PR 补充该路径，支持更广泛的 BF16 模型推理场景。

Modifications

fused_moe_triton_backend.py：新增 TritonBF16MoEMethod 类，继承 QuantMethodBase，实现完整的 BF16 FusedMoE forward 流程（路由 → preprocess → Triton GEMM1 → SwiGLU → Triton GEMM2 + router weight 融合）
triton_moe_kernels.py：新增 fused_moe_kernel_bf16 Triton kernel，支持 BF16 累加、int64 token 索引防溢出、router weight 融合乘法（MUL_ROUTED_WEIGHT）
moe.py：get_moe_method 中新增 FD_MOE_BACKEND=triton 分支，返回 TritonBF16MoEMethod
__init__.py：导出 TritonBF16MoEMethod

Usage or Command

export FD_MOE_BACKEND=triton
# 启动推理服务（BF16 MoE 模型）即可自动使用 Triton BF16 后端

Accuracy Tests

Checklist

[√] Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
[√ ] Format your code, run pre-commit before commit.
[√ ] Add unit tests. Please write the reason in this PR if no unit tests.
[√ ] Provide accuracy results.
[√ ] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T07:48:30Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-07T08:07:17Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 17:14:52

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 4a6986a
Merge base: cb2d7c0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

CI 仍在运行中，5 个 required 任务运行中，1 个 required 任务等待中，当前无 required 失败任务。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38(0)	38	28	1	7	2	0

2 任务状态汇总

2.1 Required任务 : 4/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	CI 详情	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	CI 详情	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	CI 详情	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	CI 详情	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	CI 详情	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
✅	其余 4 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 24/28 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	14m50s	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	CI 详情	-
⏳	`xpu_unit_test / run_xpu_unit_test`	-	CI 详情	-
⏸️	`CI_HPU`	-	-	-
✅	其余 24 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

Copilot

Pull request overview

该 PR 为 BF16 未量化 MoE 场景新增 Triton 原生 kernel 后端（TritonBF16MoEMethod），并通过环境变量 FD_MOE_BACKEND=triton 在 CUDA 平台启用，以补齐现有 Cutlass/量化路径无法覆盖的 BF16 unquantized 推理链路。

Changes:

新增 BF16 Triton MoE kernel：fused_moe_kernel_bf16，支持 BF16 计算/累加与 routed-weight 融合乘法。
新增 TritonBF16MoEMethod：实现 BF16 FusedMoE forward（routing → preprocess → GEMM1 → SwiGLU → GEMM2(+router weight) → topk reduce）。
扩展 get_moe_method 与 __init__.py 导出，并补充单测/精度对比测试。

建议同时确认是否需要在文档（如环境变量说明）中补充/强调 FD_MOE_BACKEND=triton 的 依赖条件（Triton 可用 + BF16 dtype） 与适用范围。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/layers/test_fused_moe_triton_backend.py	新增 TritonBF16MoEMethod 单测与 Triton vs Cutlass BF16 精度对比测试
fastdeploy/model_executor/layers/moe/triton_moe_kernels.py	新增 BF16 Triton fused MoE GEMM kernel（`fused_moe_kernel_bf16`）
fastdeploy/model_executor/layers/moe/moe.py	增加 `FD_MOE_BACKEND=triton` 分支选择 TritonBF16MoEMethod
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py	新增 TritonBF16MoEMethod 实现与 Triton/ops 导入
fastdeploy/model_executor/layers/moe/init.py	导出 TritonBF16MoEMethod

+        b = tl.load(
+            b_ptrs,
+            mask=offs_k[:, None] < K - k * BLOCK_SIZE_K,
+            other=0.0,
+        )


 try:
+    import triton.language as tl
+
    from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func

-    from .triton_moe_kernels import fused_moe_kernel_paddle
+    from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle


    if current_platform.is_cuda():
+        moe_backend = envs.FD_MOE_BACKEND.lower()
+        if moe_backend == "triton":
+            from .fused_moe_triton_backend import TritonBF16MoEMethod
+


+@pytest.mark.skipif(not paddle.is_compiled_with_cuda(), reason="requires CUDA")
+class TestTritonBF16MoEPrecision:
+    """
+    Precision tests: Triton BF16 path vs. Cutlass BF16 path.
+


codecov-commenter · 2026-05-07T11:44:40Z

Codecov Report

❌ Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cb2d7c0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...el_executor/layers/moe/fused_moe_triton_backend.py	98.43%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7734   +/-   ##
==========================================
  Coverage           ?   63.45%           
==========================================
  Files              ?      461           
  Lines              ?    64145           
  Branches           ?     9814           
==========================================
  Hits               ?    40705           
  Misses             ?    20644           
  Partials           ?     2796

Flag	Coverage Δ
GPU	`72.62% <98.55%> (?)`
XPU	`7.13% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BingooYang · 2026-05-08T05:57:40Z

+
+        # --- 1. Routing ---
+        gate_out = gate(x)
+        gate_out = gate_out.cast("float32")


这里可以参考https://github.com/PaddlePaddle/FastDeploy/pull/6880/changes#diff-46e14b741cda3e25246673cc81128fd31c237c9570ae8e1c273c4dd6c11ef25b，把elementwise fusion加上去

yuanlehome · 2026-05-08T09:04:47Z

+    def create_weights(self, layer: nn.Layer, **extra_weight_attrs):
+        """
+        Reuse UnquantizedFusedMoEMethod weight creation logic.
+        Weight shapes on CUDA (non-torch format):
+          up_gate_proj_weight: [E, hidden_size, moe_intermediate_size * 2]  (K-major)
+          down_proj_weight:    [E, moe_intermediate_size, hidden_size]       (K-major)
+        The Triton kernel reads B as [E, K, N] which maps directly to these shapes.
+        """
+        from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import (
+            UnquantizedFusedMoEMethod,
+        )
+
+        UnquantizedFusedMoEMethod.create_weights(self, layer, **extra_weight_attrs)
+
+    def process_weights_after_loading(self, layer: nn.Layer):
+        from fastdeploy.model_executor.layers.moe.fused_moe_backend_base import (
+            UnquantizedFusedMoEMethod,
+        )
+
+        UnquantizedFusedMoEMethod.process_weights_after_loading(self, layer)


TritonBF16MoEMethod是不是可以继承自UnquantizedFusedMoEMethod呢？感觉这样会更好

yuanlehome · 2026-05-08T09:08:04Z

        )
+
+
+class TritonBF16MoEMethod(QuantMethodBase):


个人建议不需要强调是BF16，感觉推理场景中默认的理解就是BF16精度，直接就是TritonMoEMethod呢？

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 17:12:44

📋 Review 摘要

PR 概述：为 BF16 unquantized MoE 场景新增 TritonMoEMethod，通过环境变量 FD_MOE_BACKEND=triton 激活 Triton 原生 BF16 kernel 推理路径。

变更范围：fastdeploy/model_executor/layers/moe/（新增类、新增 kernel、路由分支）、tests/layers/

影响面 Tag：[OP]

📝 PR 规范检查

标题格式合规（[Feature] 为官方 Tag），PR 描述包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 所有必填段，结构核验通过。

问题

级别	文件	概述
🔴 Bug	`fused_moe_triton_backend.py:27`	`triton_moe_kernels` 导入移出 try/except，triton 未安装时整包加载失败
🟡 建议	`triton_moe_kernels.py:245`	`naive_block_assignment` 参数已注释掉，但 docstring 仍保留完整描述，文档与接口不符

问题详情

🔴 Bug — 无保护导入（P0）

fused_moe_triton_backend.py:25-28

本次重构将原 try/except 块内的 from .triton_moe_kernels import fused_moe_kernel_paddle 移出保护范围，同时新增 fused_moe_kernel_bf16 也在保护外直接导入：

# 当前（无保护）
from fastdeploy.model_executor.layers.moe.triton_moe_kernels import (
    fused_moe_kernel_bf16,
    fused_moe_kernel_paddle,
)

而 triton_moe_kernels.py 顶部有无条件的 import triton / import triton.language as tl（第 17-18 行）。若 triton 未安装，fused_moe_triton_backend.py 模块加载时就会抛出 ImportError，该异常经由 __init__.py 的直接导入传播，导致整个 fastdeploy.model_executor.layers.moe 包无法加载，影响所有 MoE 场景（包括原有 Cutlass 路径）。

建议修复：

try:
    import triton.language as tl
    from fastdeploy.model_executor.ops.gpu import tritonmoe_preprocess_func
    from .triton_moe_kernels import fused_moe_kernel_bf16, fused_moe_kernel_paddle
except ImportError:
    pass

🟡 建议 — docstring 残留描述（P1）

triton_moe_kernels.py:258

fused_moe_kernel_bf16 的 docstring 详细描述了 naive_block_assignment=True 的工作模式，但该参数已被注释掉（第 245 行）且相关实现代码全部注释（第 284-292 行）。当前函数签名中不存在该参数，文档描述形同幽灵接口，容易误导使用者。

建议删除 docstring 中 When naive_block_assignment=True... 一段，或改为：

# TODO: naive_block_assignment mode not yet implemented

总体评价

新增的 TritonMoEMethod 整体设计合理，GEMM1/GEMM2 的 BF16 路由流程、router weight 融合、tile 启发式配置均与 vLLM 参考实现对齐，测试覆盖较为充分。但 triton_moe_kernels 模块导入从 try/except 保护中移出是本次重构的回归点，会导致 triton 未安装环境下整个 moe 包不可用，需修复后合入。

xuanyuanminzheng temporarily deployed to Metax_ci May 7, 2026 07:48 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng force-pushed the develop branch from 679bc3e to 46fdad2 Compare May 7, 2026 08:32

xuanyuanminzheng had a problem deploying to Metax_ci May 7, 2026 08:32 — with GitHub Actions Failure

xuanyuanminzheng temporarily deployed to Metax_ci May 7, 2026 08:47 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

qingqing01 requested review from PaddlePaddle-bot and Copilot May 7, 2026 10:15

Copilot started reviewing on behalf of qingqing01 May 7, 2026 10:16 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

BingooYang reviewed May 8, 2026

View reviewed changes

qingqing01 previously approved these changes May 8, 2026

View reviewed changes

yuanlehome reviewed May 8, 2026

View reviewed changes

xuanyuanminzheng dismissed qingqing01’s stale review via 21923b0 May 9, 2026 10:23

xuanyuanminzheng force-pushed the develop branch from 9072452 to 21923b0 Compare May 9, 2026 10:23

xuanyuanminzheng had a problem deploying to Metax_ci May 9, 2026 10:23 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng had a problem deploying to Metax_ci May 11, 2026 03:51 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng had a problem deploying to Metax_ci May 11, 2026 08:40 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

[Feature] Add TritonMoEMethod for BF16 MoE inference

2c6c2cf

xuanyuanminzheng force-pushed the develop branch from f643723 to 2c6c2cf Compare May 14, 2026 06:44

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 06:44 — with GitHub Actions Failure

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 06:50 — with GitHub Actions Error

“fix dev test file.”

e99e442

xuanyuanminzheng force-pushed the develop branch from 075a52a to e99e442 Compare May 14, 2026 07:01

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 07:02 — with GitHub Actions Error

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 07:03 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix bug.

ce06816

xuanyuanminzheng force-pushed the develop branch from 481188b to ce06816 Compare May 14, 2026 08:53

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 08:53 — with GitHub Actions Failure

Merge branch 'develop' into develop

4a6986a

xuanyuanminzheng had a problem deploying to Metax_ci May 14, 2026 08:58 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot suggested changes May 14, 2026

View reviewed changes

Comment thread fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py

xuanyuanminzheng requested a review from yuanlehome May 15, 2026 08:28

		)


		class TritonBF16MoEMethod(QuantMethodBase):

Conversation

xuanyuanminzheng commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 4/10 通过

2.2 可选任务 — 24/28 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BingooYang May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yuanlehome May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yuanlehome May 8, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

问题详情

🔴 Bug — 无保护导入（P0）

🟡 建议 — docstring 残留描述（P1）

总体评价

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xuanyuanminzheng commented May 7, 2026 •

edited

Loading

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading