[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

ElizaWszola · 2025-02-27T15:52:58Z

CUTLASS implementation of fp8 MoE kernel.

Tested with

llm = LLM(model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8",
          trust_remote_code=True,
          tensor_parallel_size=2,
 )

Benchmark (Deepseek V2 Lite, total time of 25 runs)

[--------------------------------------------------------------------------------------------------------- Quant Matmul ---------------------------------------------------------------------------------------------------------]
                                                                                                                                    |  triton_moe  |  triton_moe_cuda_graphs  |  grouped_gemm_moe  |  grouped_gemm_moe_cuda_graphs
1 threads: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((1, 2048, 1408))               |      3.6     |            2.6           |         3.6        |               3.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((4, 2048, 1408))               |      6.8     |            6.7           |         4.7        |               4.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((8, 2048, 1408))               |     10.1     |           10.0           |         5.6        |               5.1            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((16, 2048, 1408))              |     15.0     |           14.9           |         6.8        |               6.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((32, 2048, 1408))              |     16.9     |           16.8           |         7.3        |               6.9            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((64, 2048, 1408))              |     17.0     |           16.9           |         7.6        |               7.1            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((128, 2048, 1408))             |      8.5     |            8.4           |         8.1        |               7.6            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((256, 2048, 1408))             |      9.1     |            9.0           |         9.0        |               8.5            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((512, 2048, 1408))             |     10.9     |           10.8           |        10.6        |              10.1          
(times are in ms)

Signed-off-by: ElizaWszola <[email protected]>

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth · 2025-03-26T17:58:33Z

csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp

+  template <typename Descriptor, typename T>
+  static auto args_from_tensor(const T* const* data_ptr, bool do_broadcast) {
+    using Arguments = typename Descriptor::Arguments;
+    static_assert(std::is_same_v<Descriptor, ColOrScalarLoadArray<T>> ||
+                  std::is_same_v<Descriptor, RowOrScalarLoadArray<T>>);
+    return Arguments{data_ptr, do_broadcast};
+  }


Consider revisiting this interface in a follow up?

csrc/quantization/cutlass_w8a8/grouped_mm_c3x.cuh

csrc/quantization/cutlass_w8a8/grouped_mm_c3x.cu

tlrmchlsmth

Spotted some issues, mainly around the CUDA version and compute capability checks

csrc/quantization/cutlass_w8a8/get_group_starts.cuh

csrc/quantization/cutlass_w8a8/grouped_mm_c3x.cuh

csrc/quantization/cutlass_w8a8/moe_data.cu

csrc/quantization/cutlass_w8a8/get_group_starts.cuh

tests/kernels/test_cutlass.py

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu

CMakeLists.txt

tests/kernels/test_cutlass_moe.py

Signed-off-by: ElizaWszola <[email protected]>

CMakeLists.txt

csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu

Signed-off-by: ElizaWszola <[email protected]>

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

tlrmchlsmth · 2025-03-26T20:26:46Z

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu

+    torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
+    torch::Tensor const& b_strides, torch::Tensor const& c_strides) {
+  int32_t version_num = get_sm_version_num();
+#if defined ENABLE_SCALED_MM_SM90 && ENABLE_SCALED_MM_SM90


Needs to be ENABLE_CUTLASS_MOE_SM90 now

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth

Looks good to me now!

LucasWilkinson

sorry had some pending review comments I forgot to submit, submitting now for posterity. Most was for future PRs anyways

LucasWilkinson · 2025-03-24T14:22:10Z

vllm/utils.py

 def weak_ref_tensors(
    tensors: Union[torch.Tensor, list[torch.Tensor], tuple[torch.Tensor]]
-) -> Union[torch.Tensor, list[torch.Tensor], tuple[torch.Tensor]]:
+) -> Union[torch.Tensor, list[Any], tuple[Any], Any]:


nit: Does a type union containing Any do anything more than Any?

LucasWilkinson · 2025-03-24T14:24:47Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py


        if quant_config._is_wNa16_group_channel(weight_quant, input_quant):
            return CompressedTensorsWNA16MoEMethod(quant_config)
+        elif (quant_config._is_fp8_w8a8_sm90(weight_quant, input_quant)


nit: for future PR, we should abstract this more like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization/kernels/scaled_mm to make it easier to adopt as a backend for non-compressed-tensor checkpoints

LucasWilkinson · 2025-03-27T01:34:24Z

csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_array_c3x.hpp

+  class StrideMNL = Stride<_0,_1,_0>,
+  int Alignment = 128 / sizeof_bits_v<Element>
+>
+struct Sm90RowOrScalarBroadcastArray {


future pr: we should see if we should can use the now upstream IsArrayOfPointers support in Sm90RowBroadcast to eliminate this file

li2haipeng · 2025-04-01T18:49:38Z

Thanks for the PR! Seems now the PR only supports quant_method=compressed_tensor. I'm wondering if you have tested it on DeepSeek V3 or other quant_method=fp8 models? Do we have plans to support them?

tlrmchlsmth · 2025-04-01T18:53:09Z

Thanks for the PR! Seems now the PR only supports quant_method=compressed_tensor. I'm wondering if you have tested it on DeepSeek V3 or other quant_method=fp8 models? Do we have plans to support them?

DeepSeekV3's blocked per-token quantization isn't supported by these kernels yet and requires additional work, but this is on our roadmap

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: xinyuxiao <[email protected]>

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: Mu Huai <[email protected]>

shixianc · 2025-06-07T00:30:23Z

DeepSeekV3's blocked per-token quantization isn't supported by these kernels yet and requires additional work, but this is on our roadmap

What about per-tensor quantization? We have a fp8 model quant_method=fp8 quantized thru AutoFP8. Could you give some suggestion? I'd like to contribute but want to hear your suggestion

ElizaWszola and others added 30 commits December 6, 2024 14:36

Cutlass grouped gemm files

1825ef8

Signed-off-by: ElizaWszola <[email protected]>

runs, bad result

5fd48e5

Signed-off-by: ElizaWszola <[email protected]>

A little closer to working

d5942cf

Signed-off-by: ElizaWszola <[email protected]>

Working for identical sizes

c570c69

Signed-off-by: ElizaWszola <[email protected]>

Grouped gemm working

6ed63f2

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

Small cleanup

e2b1fc0

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

dd163f5

Signed-off-by: ElizaWszola <[email protected]>

Benchmark grouped cutlass against bfloat16 torch.mm

acfd3ef

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

c6231b6

Signed-off-by: ElizaWszola <[email protected]>

Start working on fused moe cutlass implementation

f1a5666

Signed-off-by: ElizaWszola <[email protected]>

Working halfway

6414e31

Signed-off-by: ElizaWszola <[email protected]>

working mul test but the topk_weights are not yet included in kernel

67e2dd4

Signed-off-by: ElizaWszola <[email protected]>

cleaned up cutlass moe test, fixes

6523529

Signed-off-by: ElizaWszola <[email protected]>

benchmark fused

b302d98

Signed-off-by: ElizaWszola <[email protected]>

pass input as one tensor with an array of offsets rather than a list …

342d1a4

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Using tensors rather than tensor lists works with test_cutlass test

7549e3d

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

64c2a68

Signed-off-by: ElizaWszola <[email protected]>

cleanup, add import

1ea7874

Signed-off-by: ElizaWszola <[email protected]>

working fused op

d608164

Signed-off-by: ElizaWszola <[email protected]>

benchmark, create strides directly on device, small name refactor

286f6c8

Signed-off-by: ElizaWszola <[email protected]>

works with cuda graphs

b6867bb

Signed-off-by: ElizaWszola <[email protected]>

move stride tensor creation outside c++ code, cleanup

df04bc0

Signed-off-by: ElizaWszola <[email protected]>

cleanup benchmark

88c7134

Signed-off-by: ElizaWszola <[email protected]>

profile

02e1d4e

Signed-off-by: ElizaWszola <[email protected]>

tuned shapes, fix

1d9c429

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

b824ad2

Signed-off-by: ElizaWszola <[email protected]>

Performance, add channelwise scales everywhere

ae90eee

Signed-off-by: ElizaWszola <[email protected]>

name fix

f191b35

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

22d4f7b

perf improvements in data preparation

51941ff

Signed-off-by: ElizaWszola <[email protected]>

Fix handling optional vals

c8f1567

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth reviewed Mar 26, 2025

View reviewed changes

tests/kernels/test_cutlass_moe.py Show resolved Hide resolved

feedback: version checks, file structure

96296cb

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth reviewed Mar 26, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Mar 26, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu Outdated Show resolved Hide resolved

Change cmake flag, remove unused code

3977d67

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth reviewed Mar 26, 2025

View reviewed changes

update kernel run conditions in scaled_mm_entry.cu

83ee170

Signed-off-by: ElizaWszola <[email protected]>

tlrmchlsmth approved these changes Mar 26, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2025

robertgshaw2-redhat enabled auto-merge (squash) March 27, 2025 00:51

robertgshaw2-redhat merged commit 9239bf7 into vllm-project:main Mar 27, 2025
67 checks passed

LucasWilkinson reviewed Mar 27, 2025

View reviewed changes

LucasWilkinson mentioned this pull request Mar 27, 2025

Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

BBuf mentioned this pull request Apr 23, 2025

[WIP] sgl kernel support fp8 per_tensor/per_token cutlass moe sgl-project/sglang#5680

Closed

10 tasks

yangsijia-serena mentioned this pull request Apr 26, 2025

[Kernel] Add expert_map support to Cutlass FP8 MOE #16861

Merged

Uh oh!

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

Uh oh!

Conversation

ElizaWszola commented Feb 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

li2haipeng commented Apr 1, 2025

Uh oh!

tlrmchlsmth commented Apr 1, 2025

Uh oh!

shixianc commented Jun 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ElizaWszola commented Feb 27, 2025 •

edited by github-actions bot

Loading