Skip to content

Conversation

@antsaukk
Copy link

@antsaukk antsaukk commented Nov 7, 2025

Motivation

This change introduces merge of kernel selection API asm_moe_tkw1 from fused_moe_bf16_asm.py into fused_moe.py. The change is required in order to unlock kernel selection logic for tkw1 case from predefined tuned_fmoe.csv config, which brings performance improvement for low concurrency cases in top tier customer model.

Technical Details

In order to unlock kernel selection functionality from tuned_fmoe.csv config for asm_moe_tkw1, the function call to asm_moe_tkw1 is forwarded into fused_moe. The selection logic of asm_moe_tkw1 has been left unchanged but is extracted into a new function fused_moe_stage1_tkw1 under fused_moe.py module. The functions' interfaces have been modified accordingly to accommodate new arguments.

Test Plan

The functionality has been tested with aiter/op_tests to verify that selection logic works as intended. In addition to that, tests within the model were conducted to confirm end-to-end correctness.

Test Result

Above mentioned tests confirm that correctness is preserved.

Submission Checklist

At least those test cases should be verified:

  1. AITER_LOG_MORE=1 python ./op_tests/test_moe_tkw1.py -m 1 -dim 5120 -hdim 1024 -e 128 -d bf16, uses 1 stage fmoe; selects 32x64 kernel
  2. AITER_LOG_MORE=1 python ./op_tests/test_moe_tkw1.py -m 17 -dim 5120 -hdim 1024 -e 128 -d bf16, uses 1 stage fmoe; selects 32x128 kernel
  3. AITER_LOG_MORE=1 python ./op_tests/test_moe_tkw1.py -m 300 -dim 5120 -hdim 1024 -e 128 -d bf16, uses 1 stage fmoe; selects 32x128 kernel
  4. AITER_LOG_MORE=1 python ./op_tests/test_moe_2stage.py -d bf16 -dim 5120,1024 -t 1 -q 2 -a silu -e 128 -k 1 uses 2 stage moe, with 32x512 asm kernel and ck_moe_stage2 selected
  5. AITER_LOG_MORE=1 python ./op_tests/test_moe_2stage.py -d bf16 -dim 5120,1024 -t 17 -q 2 -a silu -e 128 -k 1 uses 2 stage moe, with 32x512 asm kernel and ck_moe_stage2 selected

Additional notes

Please note that files
hsa/gfx950/fmoe/gelu/fmoe_bf16_pertokenFp8_g1u1_gelu_tkw1.csv, hsa/gfx950/fmoe/silu/fmoe_bf16_pertokenFp8_g1u1_silu_tkw1.csv,
hsa/gfx950/fmoe/silu/fmoe_bf16_pertokenInt8_g1u1_tkw1_silu_32x64.co,
hsa/gfx950/fmoe/gelu/fmoe_bf16_pertokenInt8_g1u1_tkw1_gelu_32x64.co,
aiter/configs/tuned_fmoe.csv

are not meant to be merged in this PR. However, they are required here in order to verify correctness functionality. Upon approvement, they will be removed and merger in separate PR.

@antsaukk antsaukk marked this pull request as draft November 7, 2025 11:34
@antsaukk antsaukk self-assigned this Nov 7, 2025
@antsaukk antsaukk marked this pull request as ready for review November 12, 2025 12:39
valarLip
valarLip previously approved these changes Nov 19, 2025
Copy link
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks your contribution, LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants