[RL] MoE BF16 support m_grouped_bf16_gemm_nn_contiguous in EP#7527
[RL] MoE BF16 support m_grouped_bf16_gemm_nn_contiguous in EP#7527ckl117 merged 1 commit intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
83afc3e to
dbb856b
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
� AI Code Review |
2026-04-21 16:12:39
� Review ��
PR æ¦�è¿°ï¼�å�¨ Cutlass MoE å��端ä¸å¼�å�¥ m_grouped_bf16_gemm_nn_contiguousï¼�å�ºäº� DeepGemmï¼�æ�¿ä»£ paddle.incubate.nn.functional.batched_gemmï¼�å¹¶ä¿�ç�� fallback è·¯å¾�ã��
�����model_executor/layers/moe/fused_moe_cutlass_backend.py
影�� Tag�OP
� PR ����
PR æ��è¿°ä¸ Motivation / Modifications / Usage or Command ç�å¿�å¡«ç« è��å��æ�ªå¡«å��ï¼�建议补å��å��æ�´å�¨æ�ºå��å�·ä½�ä¿®æ�¹å��容ã��æ �é¢�ä¸ "del paddle.batch_gemm" 表述ä¸�å��ç¡®â��â��batched_gemm å®�é��ä¸�ä½�为 fallback 被ä¿�ç��è��é��å� é�¤ã��
æ �é¢�建议ï¼�å�¯ç�´æ�¥å¤�å�¶ï¼�ï¼�
[RL] Cutlass MoE �端使� DeepGemm grouped_gemm �代 batched_gemm
**�述模�**��������
## Motivation
å�¨ Cutlass MoE å��端ç�� EP prefill è·¯å¾�ä¸ï¼�使ç�¨ paddlefleet_ops æ��ä¾�ç�� DeepGemm `m_grouped_bf16_gemm_nn_contiguous` æ�¿æ�¢ `paddle.incubate.nn.functional.batched_gemm`ï¼�以è�·å¾�æ�´å¥½ç�� BF16 GEMM æ�§è�½ã��å½� paddlefleet_ops ä¸�å�¯ç�¨æ�¶ä¿�ç��å��æ�� batched_gemm ä½�为 fallbackã��
## Modifications
1. �� `m_grouped_bf16_gemm_nn_contiguous` ������ DeepGemm ��
2. `moe_permute` å¢�å� `return_expert_indices=True` å��æ�°ä»¥è�·å�� `m_indices`
3. 两� `batched_gemm` ���为�件�����使� DeepGemm�fallback � batched_gemm
��
| 级� | �件 | �述 |
|---|---|---|
| � 建议 | fused_moe_cutlass_backend.py:56 |
����������模���� backend ���� |
| � �� | fused_moe_cutlass_backend.py:57 |
paddle.empty æ�ªå��å§�å��å��å�ï¼�é��确认åº�å±�ç®�å�å®�æ�´å��å�¥ |
���价
��������件������好� fallback �容��主�建议补� PR �述并����������"
| ) | ||
|
|
||
|
|
||
| def m_grouped_bf16_gemm_nn_contiguous(x, y, expert_idx_per_token): |
There was a problem hiding this comment.
🟡 建议 此函数与 fused_moe_deepgemm_backend.py / fused_moe_blackwell_backend.py 中可能存在的类似封装有重复定义的风险。
建议将该工具函数提取到一个公共模块(如 moe/utils.py 或 moe/__init__.py)中,避免多处重复定义,方便后续统一维护。
|
|
||
|
|
||
| def m_grouped_bf16_gemm_nn_contiguous(x, y, expert_idx_per_token): | ||
| out = paddle.empty([x.shape[0], y.shape[-1]], dtype=x.dtype) |
There was a problem hiding this comment.
❓ 疑问 paddle.empty 创建的输出张量内容未初始化。请确认底层 paddlefleet_ops.deep_gemm.m_grouped_bf16_gemm_nn_contiguous 会完整写入 out 的所有元素,否则在 token 数为 0 或 expert_idx_per_token 为空时可能返回包含未初始化内存的张量。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7527 +/- ##
==========================================
Coverage ? 72.89%
==========================================
Files ? 419
Lines ? 57483
Branches ? 9004
==========================================
Hits ? 41902
Misses ? 12755
Partials ? 2826
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on tests with paddlefleet.
Motivation
BF16-MoE在EP prefill使用m_grouped_bf16_gemm_nn_contiguous对齐训练精度。
依赖env:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.