Skip to content

[1/2] Add Kernel support for Cutlass based Fused FP4 MoE#6093

Merged
zhyncs merged 15 commits intosgl-project:mainfrom
pavanimajety:cutlass_fp4_moe
Jun 2, 2025
Merged

[1/2] Add Kernel support for Cutlass based Fused FP4 MoE#6093
zhyncs merged 15 commits intosgl-project:mainfrom
pavanimajety:cutlass_fp4_moe

Conversation

@pavanimajety
Copy link
Copy Markdown
Collaborator

@pavanimajety pavanimajety commented May 7, 2025

This kernel adds support for NVFP4 MoE kernels.

Currently measured perf:

[--------------------------------------------------------------------------------------------- FP4 MOE vs FP8 Triton ---------------------------------------------------------------------------------------------]
                                                                                                                       |  triton_moe  |  triton_moe_cuda_graphs  |  cutlass_moe_fp4  |  cutlass_moe_fp4_cuda_graphs
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((4, 2048, 7168))     |     10.7     |            9.9           |        12.3       |              10.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((8, 2048, 7168))     |     17.5     |           16.7           |        16.4       |              14.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((16, 2048, 7168))    |     29.5     |           28.8           |        23.4       |              21.5
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((32, 2048, 7168))    |     45.8     |           45.1           |        32.8       |              31.2
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((64, 2048, 7168))    |     60.9     |           60.0           |        41.5       |              40.3
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((128, 2048, 7168))   |     69.5     |           68.8           |        47.8       |              45.3
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((256, 2048, 7168))   |     72.3     |           71.4           |        50.9       |              49.9
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((512, 2048, 7168))   |     80.9     |           80.1           |        59.6       |              57.8
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((1024, 2048, 7168))  |     93.7     |           92.8           |        75.9       |              74.7
      nvidia/DeepSeek-R1-FP4, num_experts=256, topk=8, per_act_token=False per_out_ch=False, MKN=((2048, 2048, 7168))  |    142.3     |          142.0           |       109.2       |             107.6

Times are in milliseconds (ms).

Follow up PR would include model support for nvidia/DeepSeekR1-FP4

Motivation

Modifications

Checklist

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants