[1/2] Add Kernel support for Cutlass based Fused FP4 MoE by pavanimajety · Pull Request #6093 · sgl-project/sglang

pavanimajety · 2025-05-07T18:17:06Z

This kernel adds support for NVFP4 MoE kernels.

Currently measured perf:

[--------------------------------------------------------------------------------------------- FP4 MOE vs FP8 Triton ---------------------------------------------------------------------------------------------]
                                                                                                                       |  triton_moe  |  triton_moe_cuda_graphs  |  cutlass_moe_fp4  |  cutlass_moe_fp4_cuda_graphs
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((4, 2048, 7168))     |     10.7     |            9.9           |        12.3       |              10.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((8, 2048, 7168))     |     17.5     |           16.7           |        16.4       |              14.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((16, 2048, 7168))    |     29.5     |           28.8           |        23.4       |              21.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((32, 2048, 7168))    |     45.8     |           45.1           |        32.8       |              31.2
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((64, 2048, 7168))    |     60.9     |           60.0           |        41.5       |              40.3
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((128, 2048, 7168))   |     69.5     |           68.8           |        47.8       |              45.3
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((256, 2048, 7168))   |     72.3     |           71.4           |        50.9       |              49.9
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((512, 2048, 7168))   |     80.9     |           80.1           |        59.6       |              57.8
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((1024, 2048, 7168))  |     93.7     |           92.8           |        75.9       |              74.7
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((2048, 2048, 7168))  |    142.3     |          142.0           |       109.2       |             107.6

Times are in milliseconds (ms).

Follow up PR would include model support for nvidia/DeepSeekR1-FP4

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/2] Add Kernel support for Cutlass based Fused FP4 MoE#6093

[1/2] Add Kernel support for Cutlass based Fused FP4 MoE#6093
zhyncs merged 15 commits intosgl-project:mainfrom
pavanimajety:cutlass_fp4_moe

pavanimajety commented May 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pavanimajety commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pavanimajety commented May 7, 2025 •

edited

Loading