Skip to content

chore: MoE benchmark effective BW fix for trtllm_block_scale_moe#2341

Merged
yzh119 merged 4 commits intoflashinfer-ai:mainfrom
rosenrodt:chore-moe-bench-eff-bw
Jan 14, 2026
Merged

chore: MoE benchmark effective BW fix for trtllm_block_scale_moe#2341
yzh119 merged 4 commits intoflashinfer-ai:mainfrom
rosenrodt:chore-moe-bench-eff-bw

Conversation

@rosenrodt
Copy link
Contributor

@rosenrodt rosenrodt commented Jan 13, 2026

📌 Description

The MoE benchmark script overestimates the num bytes loaded by assuming all experts are active. I saw effective BW exceeds 3x the peak BW of some system as a result. The fix is to calculate the routed experts (topk_ids) on the host side and count the unique number of experts, the same logic cutlass_fused_moe does.

While investigating the above issue, I also found data init of routing_bias using rand() results in very skewed expert distribution (repro cmd below gives 18 active out of 128 experts). I'd like to change it to ones()*0.1 for smoother expert distribution (noe giving 114 out of 128), while maintaining the same load/compute behavior in the kernels.

python3 flashinfer_benchmark.py --routine trtllm_fp4_block_scale_moe --num_tokens 32 --hidden_size 7168 --intermediate_size 2048 --num_experts 128 --routing_method deepseek_v3 --top_k 8 --n_group 8 --topk_group 4 --routed_scaling_factor 2.5 --use_routing_bias --use_shuffled_weight --generate_repro_command -vv

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added support for nvfp4 and mxfp4 quantization formats in bandwidth calculations.
    • Introduced routing support for DeepSeekV3 method.
  • Improvements

    • Enhanced routing bias initialization for more consistent expert distribution.
    • Expanded routing computation utilities for greater flexibility.
  • Tests

    • Updated benchmark test data to align with new routing and quantization logic.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Added a new helper _compute_routing_for_method to compute routing for multiple routing types (including DeepSeekV3), switched routing_bias initialization to a uniform value when enabled, extended bandwidth/format handling for nvfp4/mxfp4, and updated MOE test routines to use the new routing path and propagate active_expert counts and verbose flags.

Changes

Cohort / File(s) Summary
Routing core & imports
benchmarks/routines/moe.py
Imported RoutingMethodType and fused_topk_deepseek; added _compute_routing_for_method(...) to route via DeepSeekV3 or existing methods and return selected expert indices.
Routing bias & test data
benchmarks/routines/moe.py
Changed routing bias initialization to uniform 0.1 when use_routing_bias is enabled; updated create_trtllm_moe_test_data and test data creation to align with new routing logic.
Bandwidth, formats & logging
benchmarks/routines/moe.py
Extended get_effective_bytes to recognize nvfp4 and mxfp4; added verbose output for num_active_experts (verbose >= 2); threaded verbose through calculate_moe_bandwidth and related calls.
Test routines & format adjustments
benchmarks/routines/moe.py
Updated MOE test routines to call _compute_routing_for_method to obtain selected_experts, compute active_experts as unique count, pass active_experts and verbose into bandwidth calculations, and adjust input/weight format expectations (e.g., nvfp4 for FP4 variants; FP8 paths keep fp8).

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • cyx-6
  • bkryu
  • jiahanc

Poem

🐰 I nibble bits and route with cheer,

DeepSeek hops in, the experts appear,
Bias set steady, no wild skew,
Bandwidth hums — a clearer view,
Hoppity code, the tests run true 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'chore: MoE benchmark effective BW fix for trtllm_block_scale_moe' directly describes the main fix: correcting effective bandwidth calculation in the MoE benchmark for the specific routing method.
Description check ✅ Passed The description provides a clear explanation of the issue, the fix approach, the secondary fix for routing_bias initialization, and includes a reproducible command. Pre-commit checks are marked complete. However, test status is not checked, which is a required section in the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @rosenrodt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Mixture-of-Experts (MoE) benchmarking framework by addressing critical inaccuracies in effective bandwidth calculations. It introduces a more granular approach to determine the memory footprint of different FP4 quantization schemes and ensures that the count of active experts, crucial for bandwidth metrics, is precisely derived from the specific routing algorithm in use. These improvements lead to more robust and accurate performance assessments for MoE models.

Highlights

  • Accurate Bandwidth Calculation for FP4: The effective byte size calculation for FP4 formats has been refined to differentiate between 'nvfp4' and 'mxfp4', adding specific overheads (1/16 and 1/32 respectively) for more precise memory bandwidth estimations.
  • Correct Active Expert Determination: A new utility function, _compute_routing_for_method, has been introduced to accurately determine the number of active experts based on the specific routing method (e.g., DeepSeekV3), ensuring that bandwidth metrics reflect actual expert usage.
  • Integration into MoE Benchmarks: The updated active expert calculation is now integrated into various TRT-LLM MoE benchmarks (FP4 block scale, FP8 block scale, and FP8 per-tensor scale) to provide more reliable performance evaluations.
  • Simplified Format Handling: The logic for input_format and weight_format parameters in the run_cutlass function has been streamlined for improved clarity and consistency.
  • Uniform Routing Bias for Testing: Test data generation for routing bias now uses a uniform distribution, aiming to create less skewed expert distributions for more stable and representative benchmark results.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MoE benchmark to more accurately calculate the effective bandwidth, particularly for the trtllm_block_scale_moe routine. This is achieved by introducing a new function _compute_routing_for_method to determine the actual number of active experts based on the specific routing method, like DeepSeekV3. The changes are logical and improve the accuracy of the benchmark. I've added a few suggestions to improve code consistency and maintainability.

Comment on lines +1358 to +1368
# Compute selected experts for accurate bandwidth calculation
# Use the actual routing method to get correct expert assignments
selected_experts = _compute_routing_for_method(
routing_logits=routing_logits,
routing_bias=routing_bias,
top_k=top_k,
routing_method_type=routing_method_type,
n_group=n_group,
topk_group=topk_group,
routed_scaling_factor=routed_scaling_factor,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code to compute selected_experts is duplicated in testTrtllmFp4BlockScaleMoe (lines 664-674), testTrtllmFp8BlockScaleMoe (lines 1358-1368), and testTrtllmFp8PerTensorScaleMoe (lines 1626-1636). To improve maintainability and reduce redundancy, consider extracting this logic into a helper function. This would make the code cleaner and easier to update in the future.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
benchmarks/routines/moe.py (3)

1492-1505: Missing verbose parameter for consistency.

The calculate_moe_bandwidth call here doesn't include verbose=args.verbose, unlike testTrtllmFp4BlockScaleMoe (line 873). This means -vv won't print the active expert count for this routine.

♻️ Suggested fix
     tb_per_sec = calculate_moe_bandwidth(
         num_tokens,
         hidden_size,
         intermediate_size,
         num_experts,
         top_k,
         median_time,
         input_dtype,
         weight_dtype,
         input_format="fp8",
         weight_format="fp8",
         routing_logits_dtype=routing_logits.dtype,
         active_experts=int(selected_experts.unique().numel()),
+        verbose=args.verbose,
     )

1723-1736: Missing verbose parameter for consistency.

Same as testTrtllmFp8BlockScaleMoe, the verbose parameter is not passed here. For consistent behavior with testTrtllmFp4BlockScaleMoe, consider adding it.

♻️ Suggested fix
     tb_per_sec = calculate_moe_bandwidth(
         num_tokens,
         hidden_size,
         intermediate_size,
         num_experts,
         top_k,
         median_time,
         input_dtype,
         weight_dtype,
         input_format="fp8",
         weight_format="fp8",
         routing_logits_dtype=routing_logits.dtype,
         active_experts=int(selected_experts.unique().numel()),
+        verbose=args.verbose,
     )

1223-1236: Consider adding verbose parameter here as well.

For consistency with testTrtllmFp4BlockScaleMoe, this call could also pass verbose=args.verbose to enable -vv debug output for active expert count.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2062dec and 32fd692.

📒 Files selected for processing (1)
  • benchmarks/routines/moe.py
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/routines/moe.py (3)
flashinfer/fused_moe/fused_routing_dsv3.py (1)
  • fused_topk_deepseek (119-194)
flashinfer/fused_moe/core.py (1)
  • RoutingMethodType (61-75)
csrc/trtllm_fused_moe_kernel_launcher.cu (13)
  • routing_bias (158-164)
  • routing_logits (147-155)
  • args (142-144)
  • args (419-428)
  • args (419-421)
  • args (536-558)
  • args (536-538)
  • args (728-752)
  • args (728-730)
  • args (959-978)
  • args (959-960)
  • args (1133-1160)
  • args (1133-1136)
🪛 Ruff (0.14.10)
benchmarks/routines/moe.py

535-537: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)
benchmarks/routines/moe.py (7)

16-18: LGTM!

The new imports for fused_topk_deepseek and RoutingMethodType are correctly added and align with their usage in _compute_routing_for_method.


321-324: LGTM!

Using uniform routing bias (0.1) instead of random normal values creates a more consistent expert distribution across benchmark runs, which is appropriate for reproducible performance measurements.


504-564: Well-structured routing helper function.

The function correctly handles DeepSeekV3 routing using the specialized fused_topk_deepseek kernel and falls back to simple top-k for other routing methods. The parameter validation for DeepSeekV3 is thorough.

One note: the comment on lines 559-561 acknowledges that Llama4 routing is approximated with simple top-k, which is acceptable for bandwidth estimation in benchmarks but worth keeping in mind if benchmark accuracy becomes critical for that routing method.


664-674: Good: Accurate expert selection for bandwidth calculation.

Using the actual routing method to compute selected experts ensures the bandwidth calculation reflects the true number of active experts rather than a theoretical estimate.


869-874: LGTM!

The bandwidth calculation now correctly uses:

  • nvfp4 format for proper byte accounting
  • Actual active expert count from routing
  • Verbose flag for debug output

1232-1236: LGTM!

Using the variant string directly as format correctly handles:

  • "base": falls through to dtype.itemsize (unquantized)
  • "fp8": returns 1.0 byte
  • "nvfp4": returns 0.5 + 1/16 bytes

The active_experts calculation correctly uses the unique expert count from routing.


450-457: The nvfp4 and mxfp4 byte calculations are correct and verified:

  • nvfp4: 0.5 + 1/16 = 0.5625 bytes/element (4-bit values + scale factor for 16-element blocks)
  • mxfp4: 0.5 + 1/32 = 0.53125 bytes/element (4-bit values + scale factor for 32-element blocks)

The scale factor overheads match the flashinfer quantization block sizes used in the codebase. No issues found.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
benchmarks/routines/moe.py (1)

438-438: Consider documenting the new verbose parameter.

The verbose parameter was added to calculate_moe_bandwidth but the docstring (lines 443-447) wasn't updated to document it. Consider adding a brief description for completeness.

📝 Suggested docstring addition
     Args:
         input_format: Override for input representation; None uses dtype.itemsize
         weight_format: Override for weight representation; None uses dtype.itemsize
         routing_logits_dtype: Dtype for routing logits memory accounting (default float32)
+        active_experts: Number of active experts; if None, estimated as min(num_experts, top_k * num_tokens)
+        verbose: Verbosity level for debug output (0=quiet, 2=print active expert count)
     """
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32fd692 and 2b192b3.

📒 Files selected for processing (1)
  • benchmarks/routines/moe.py
🧰 Additional context used
🪛 Ruff (0.14.10)
benchmarks/routines/moe.py

535-537: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (11)
benchmarks/routines/moe.py (11)

16-18: LGTM!

The new imports for fused_topk_deepseek and RoutingMethodType are necessary for the new routing computation helper and are correctly placed with related fused_moe imports.


321-324: LGTM!

Switching from random normal to uniform bias (0.1) is a sensible change for benchmarking. Random routing bias can produce highly skewed expert distributions that aren't representative of real workloads, making benchmark results harder to interpret and compare.


450-457: LGTM!

The effective byte calculations for FP4 formats correctly account for both the 4-bit data (0.5 bytes) and the block scale overhead (1/16 for nvfp4 with 16-element blocks, 1/32 for mxfp4 with 32-element blocks). This aligns with the quantization block sizes used elsewhere in the codebase.


504-563: LGTM!

The new _compute_routing_for_method helper correctly handles DeepSeekV3 routing via fused_topk_deepseek with proper parameter validation, while falling back to simple top-k for other routing methods. The approach of computing routing on the host to count unique experts is the right fix for accurate bandwidth calculation.

One minor note: topk_values (line 544) is allocated but unused after the fused_topk_deepseek call. This is acceptable overhead for benchmark setup code since we only need the indices for counting unique experts.


664-674: LGTM!

This is the core fix for the PR - computing routing on the host before benchmarking to determine actual expert assignments. The parameters correctly mirror those passed to the kernel, ensuring the bandwidth calculation reflects real kernel behavior.


869-874: LGTM!

The bandwidth calculation now correctly uses:

  • nvfp4 format for accurate FP4 byte accounting
  • selected_experts.unique().numel() to count actually activated experts
  • Verbose flag propagation for debugging

This fixes the original issue where effective bandwidth was overstated by assuming all experts were active.


1232-1236: LGTM!

The bandwidth calculation correctly uses the variant string ("base", "fp8", or "nvfp4") as the format specifier, which maps properly to the get_effective_bytes logic. The Cutlass path appropriately uses the existing _compute_routing since it doesn't require DeepSeekV3-specific routing.


1359-1369: LGTM!

Consistent application of the routing computation fix for the FP8 block scale benchmark, ensuring accurate active expert counting.


1505-1506: LGTM!

Active experts and verbose flag correctly propagated for FP8 block scale bandwidth calculation.


1628-1638: LGTM!

Consistent routing computation for the FP8 per-tensor scale benchmark.


1737-1738: LGTM!

Active experts and verbose flag correctly propagated for FP8 per-tensor scale bandwidth calculation.

return 0.5 + 1 / 16
elif fmt == "mxfp4":
return 0.5 + 1 / 32
elif fmt == "fp8":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the weight for fp8 block-scaled? i.e. MXFP8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not considered block-scale fp8 (DeepSeek style) yet. In that case, it should be 1 fp32 scale every 128x128 block for weight, and 1 fp32 scale every 128x1 for activation.

@nv-yunzheq
Copy link
Collaborator

LGTM.
For TRTLLM gen moe case, the bandwidth is still an estimate as the selected experts are not the exact same compute from the kernel itself. But it's better and closer to what we previously have.

@rosenrodt
Copy link
Contributor Author

LGTM. For TRTLLM gen moe case, the bandwidth is still an estimate as the selected experts are not the exact same compute from the kernel itself. But it's better and closer to what we previously have.

Routing is computed internally in fairly high precision (bf16 or fp32). Therefore as long as the math is equivalent I think we're good. I will let @ChristinaZ comment if that is really the case for routing + topK in trtllm_block_scale_moe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants