Skip to content

Conversation

@tjtanaavllm
Copy link

@tjtanaavllm tjtanaavllm commented Nov 17, 2025

Motivation

This also addressed issue #1417
Since the commit a7f63e3 is only about adding preshuffle for mxfp4,
so for gfx942 that does not support fp4, the bpreshuffle argument has been set to False.

The fused moe has been retuned as the original configuration has accuracy issue. We are getting lm_eval score of 0.
Please let me know if the kernel usage or tuning procedure is not correct as the generated tuning file only have 1 kernel entry.

Technical Details

Retuning procedure that we have executed:

  1. Clean the untuned_fmoe.csv and the tuned_fmoe.csv

  2. Add the following entries into untuned_fmoe.csv.

token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1
1024,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
512,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
256,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
128,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
64,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
32,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
16,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_Token,1,0
  1. Run AITER_REBUILD=1 python3 hsa/gfx942/fmoe_2stages/tune.py at root of the repository.

Test Plan

Run E2E lmeval test for ptpc deepseek-r1

Test Result

local-completions (model=EmbeddedLLM/deepseek-r1-FP8-Dynamic,base_url=http://127.0.0.1:8000/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9447|±  |0.0063|
|     |       |strict-match    |     5|exact_match||0.9447|±  |0.0063|

Submission Checklist

Signed-off-by: tjtanaavllm <[email protected]>
Signed-off-by: tjtanaavllm <[email protected]>
@tjtanaavllm tjtanaavllm marked this pull request as draft November 17, 2025 10:56
@tjtanaavllm tjtanaavllm marked this pull request as ready for review November 17, 2025 11:01
@valarLip valarLip requested a review from yzhou103 November 17, 2025 11:51
Copy link
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzhou103
Copy link
Contributor

yzhou103 commented Nov 18, 2025

I fixed a problem of tuning with the CK solutions of fmoe in #1405. And you can retune the shapes with this procedure: 1. Clean the untuned_fmoe.csv and add the shapes you want to tune 2. Run AITER_REBUILD=1 python3 hsa/gfx942/fmoe_2stages/tune.py --all at root of the repository. It will update the shapes tuned in tuned_fmoe.csv

@tjtanaavllm
Copy link
Author

I fixed a problem of tuning with the CK solutions of fmoe in #1405. And you can retune the shapes with this procedure:

Ok. Let me try and get back to you.

@tjtanaavllm
Copy link
Author

tjtanaavllm commented Nov 18, 2025

@yzhou103
Do you know why there is this issue with the ck fmoe kernels?


Error in process:431120 info:((80, 1024, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', '_ZN5aiter44fmoe_stage1_bf16_pertokenFp8_g1u1_32x512_pf2E', 32): 'NoneType' object is not subscriptable
Error in process:431120 info:((80, 16, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x64x128x128_1x4_MulABScale_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431120 info:((80, 16, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage2', 'moe_ck2stages_gemm2_256x128x128x128_1x4_MulABScaleExpertWeight_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16', 128): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 512, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', '_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_16x512_2tg_pf2E', 16): 'NoneType' object is not subscriptable
Error in process:431118 info:((80, 16, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x32x64x128_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 32): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 16, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage2', 'moe_ck2stages_gemm2_256x64x128x256_1x4_MulABScaleExpertWeight_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 32, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x64x64x128_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 32, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage2', 'moe_ck2stages_gemm2_256x64x128x128_1x4_MulABScaleExpertWeight_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 64, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x64x64x128_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 64, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage2', 'moe_ck2stages_gemm2_256x64x128x256_1x4_MulABScaleExpertWeight_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 128, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x32x64x128_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 32): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 128, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x64x128x256_1x4_MulABScale_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 64): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 128, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x128x128x256_1x4_MulABScale_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 128): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 256, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage2', 'moe_ck2stages_gemm2_256x32x64x256_1x4_MulABScaleExpertWeight_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16', 32): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 256, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x128x64x128_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 128): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_
Error in process:431118 info:((80, 512, 7168, 256, 256, 8, <ActivationType.Silu: 0>, torch.bfloat16, torch.float8_e4m3fnuz, torch.float8_e4m3fnuz, <QuantType.per_Token: 2>, 1, 0), 'stage1', 'moe_ck2stages_gemm1_256x32x64x256_1x4_MulABScale_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16', 32): /app/debugds/update_fmoe_dsv3_ptpc_config/aiter/jit/module_moe_ck2stages_f8_f8_preshuffle_on_b16_silu_per_token_mulWeightStage2.so: undefined symbol: _Z18ck_moe_stage1_gemmIN2ck9f4x2_pk_tES1_ft10MulABScaleLNS0_24BlockGemmPipelineVersionE2ELi256ELi64ELi64ELi128ELi2ELi2ELb0ELb0ELb0ELi1EEvRKP12ihipStream_tiiiiiRPvS9_S9_S9_S9_S9_S9_S9_St8optionalIS8_ESB_

I don't reproduce this problem, have you updated your code base (ck submodule) and cleaned the build ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants