chore: MoE benchmark effective BW fix for trtllm_block_scale_moe by rosenrodt · Pull Request #2341 · flashinfer-ai/flashinfer

rosenrodt · 2026-01-13T01:48:18Z

📌 Description

The MoE benchmark script overestimates the num bytes loaded by assuming all experts are active. I saw effective BW exceeds 3x the peak BW of some system as a result. The fix is to calculate the routed experts (topk_ids) on the host side and count the unique number of experts, the same logic cutlass_fused_moe does.

While investigating the above issue, I also found data init of routing_bias using rand() results in very skewed expert distribution (repro cmd below gives 18 active out of 128 experts). I'd like to change it to ones()*0.1 for smoother expert distribution (noe giving 114 out of 128), while maintaining the same load/compute behavior in the kernels.

python3 flashinfer_benchmark.py --routine trtllm_fp4_block_scale_moe --num_tokens 32 --hidden_size 7168 --intermediate_size 2048 --num_experts 128 --routing_method deepseek_v3 --top_k 8 --n_group 8 --topk_group 4 --routed_scaling_factor 2.5 --use_routing_bias --use_shuffled_weight --generate_repro_command -vv

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added support for nvfp4 and mxfp4 quantization formats in bandwidth calculations.
- Introduced routing support for DeepSeekV3 method.
Improvements
- Enhanced routing bias initialization for more consistent expert distribution.
- Expanded routing computation utilities for greater flexibility.
Tests
- Updated benchmark test data to align with new routing and quantization logic.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-13T01:48:29Z

📝 Walkthrough

Walkthrough

Added a new helper _compute_routing_for_method to compute routing for multiple routing types (including DeepSeekV3), switched routing_bias initialization to a uniform value when enabled, extended bandwidth/format handling for nvfp4/mxfp4, and updated MOE test routines to use the new routing path and propagate active_expert counts and verbose flags.

Changes

Cohort / File(s)	Summary
Routing core & imports `benchmarks/routines/moe.py`	Imported `RoutingMethodType` and `fused_topk_deepseek`; added `_compute_routing_for_method(...)` to route via DeepSeekV3 or existing methods and return selected expert indices.
Routing bias & test data `benchmarks/routines/moe.py`	Changed routing bias initialization to uniform `0.1` when `use_routing_bias` is enabled; updated `create_trtllm_moe_test_data` and test data creation to align with new routing logic.
Bandwidth, formats & logging `benchmarks/routines/moe.py`	Extended `get_effective_bytes` to recognize `nvfp4` and `mxfp4`; added verbose output for `num_active_experts` (verbose >= 2); threaded `verbose` through `calculate_moe_bandwidth` and related calls.
Test routines & format adjustments `benchmarks/routines/moe.py`	Updated MOE test routines to call `_compute_routing_for_method` to obtain `selected_experts`, compute `active_experts` as unique count, pass `active_experts` and `verbose` into bandwidth calculations, and adjust input/weight format expectations (e.g., `nvfp4` for FP4 variants; FP8 paths keep `fp8`).

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE #2234: Modifies MoE routing handling and adds DeepSeek/fused_topk_deepseek plumbing and routing_bias/routed_scaling_factor propagation.
[BUG] Fix trtllm-gen fp4 moe renormalize routing #2049: Adjusts fused/routed top-k packing/kernel behavior related to top-k routing implementations used here.

Suggested reviewers

cyx-6
bkryu
jiahanc

Poem

🐰 I nibble bits and route with cheer,

DeepSeek hops in, the experts appear,
Bias set steady, no wild skew,
Bandwidth hums — a clearer view,
Hoppity code, the tests run true 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'chore: MoE benchmark effective BW fix for trtllm_block_scale_moe' directly describes the main fix: correcting effective bandwidth calculation in the MoE benchmark for the specific routing method.
Description check	✅ Passed	The description provides a clear explanation of the issue, the fix approach, the secondary fix for routing_bias initialization, and includes a reproducible command. Pre-commit checks are marked complete. However, test status is not checked, which is a required section in the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-13T01:48:37Z

Summary of Changes

Hello @rosenrodt, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Mixture-of-Experts (MoE) benchmarking framework by addressing critical inaccuracies in effective bandwidth calculations. It introduces a more granular approach to determine the memory footprint of different FP4 quantization schemes and ensures that the count of active experts, crucial for bandwidth metrics, is precisely derived from the specific routing algorithm in use. These improvements lead to more robust and accurate performance assessments for MoE models.

Highlights

Accurate Bandwidth Calculation for FP4: The effective byte size calculation for FP4 formats has been refined to differentiate between 'nvfp4' and 'mxfp4', adding specific overheads (1/16 and 1/32 respectively) for more precise memory bandwidth estimations.
Correct Active Expert Determination: A new utility function, _compute_routing_for_method, has been introduced to accurately determine the number of active experts based on the specific routing method (e.g., DeepSeekV3), ensuring that bandwidth metrics reflect actual expert usage.
Integration into MoE Benchmarks: The updated active expert calculation is now integrated into various TRT-LLM MoE benchmarks (FP4 block scale, FP8 block scale, and FP8 per-tensor scale) to provide more reliable performance evaluations.
Simplified Format Handling: The logic for input_format and weight_format parameters in the run_cutlass function has been streamlined for improved clarity and consistency.
Uniform Routing Bias for Testing: Test data generation for routing bias now uses a uniform distribution, aiming to create less skewed expert distributions for more stable and representative benchmark results.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the MoE benchmark to more accurately calculate the effective bandwidth, particularly for the trtllm_block_scale_moe routine. This is achieved by introducing a new function _compute_routing_for_method to determine the actual number of active experts based on the specific routing method, like DeepSeekV3. The changes are logical and improve the accuracy of the benchmark. I've added a few suggestions to improve code consistency and maintainability.

gemini-code-assist · 2026-01-13T01:50:13Z

benchmarks/routines/moe.py

+    # Compute selected experts for accurate bandwidth calculation
+    # Use the actual routing method to get correct expert assignments
+    selected_experts = _compute_routing_for_method(
+        routing_logits=routing_logits,
+        routing_bias=routing_bias,
+        top_k=top_k,
+        routing_method_type=routing_method_type,
+        n_group=n_group,
+        topk_group=topk_group,
+        routed_scaling_factor=routed_scaling_factor,
+    )


This block of code to compute selected_experts is duplicated in testTrtllmFp4BlockScaleMoe (lines 664-674), testTrtllmFp8BlockScaleMoe (lines 1358-1368), and testTrtllmFp8PerTensorScaleMoe (lines 1626-1636). To improve maintainability and reduce redundancy, consider extracting this logic into a helper function. This would make the code cleaner and easier to update in the future.

benchmarks/routines/moe.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

benchmarks/routines/moe.py (3)
1492-1505: Missing verbose parameter for consistency.

The calculate_moe_bandwidth call here doesn't include verbose=args.verbose, unlike testTrtllmFp4BlockScaleMoe (line 873). This means -vv won't print the active expert count for this routine.
♻️ Suggested fix
     tb_per_sec = calculate_moe_bandwidth(
         num_tokens,
         hidden_size,
         intermediate_size,
         num_experts,
         top_k,
         median_time,
         input_dtype,
         weight_dtype,
         input_format="fp8",
         weight_format="fp8",
         routing_logits_dtype=routing_logits.dtype,
         active_experts=int(selected_experts.unique().numel()),
+        verbose=args.verbose,
     )
1723-1736: Missing verbose parameter for consistency.

Same as testTrtllmFp8BlockScaleMoe, the verbose parameter is not passed here. For consistent behavior with testTrtllmFp4BlockScaleMoe, consider adding it.
♻️ Suggested fix
     tb_per_sec = calculate_moe_bandwidth(
         num_tokens,
         hidden_size,
         intermediate_size,
         num_experts,
         top_k,
         median_time,
         input_dtype,
         weight_dtype,
         input_format="fp8",
         weight_format="fp8",
         routing_logits_dtype=routing_logits.dtype,
         active_experts=int(selected_experts.unique().numel()),
+        verbose=args.verbose,
     )
1223-1236: Consider adding verbose parameter here as well.

For consistency with testTrtllmFp4BlockScaleMoe, this call could also pass verbose=args.verbose to enable -vv debug output for active expert count.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2062dec and 32fd692.

📒 Files selected for processing (1)

benchmarks/routines/moe.py

🧰 Additional context used

🧬 Code graph analysis (1)

benchmarks/routines/moe.py (3)

flashinfer/fused_moe/fused_routing_dsv3.py (1)

fused_topk_deepseek (119-194)

flashinfer/fused_moe/core.py (1)

RoutingMethodType (61-75)

csrc/trtllm_fused_moe_kernel_launcher.cu (13)

routing_bias (158-164)

routing_logits (147-155)

args (142-144)

args (419-428)

args (419-421)

args (536-558)

args (536-538)

args (728-752)

args (728-730)

args (959-978)

args (959-960)

args (1133-1160)

args (1133-1136)

🪛 Ruff (0.14.10)

benchmarks/routines/moe.py

535-537: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)

benchmarks/routines/moe.py (7)

16-18: LGTM!

The new imports for fused_topk_deepseek and RoutingMethodType are correctly added and align with their usage in _compute_routing_for_method.

321-324: LGTM!

Using uniform routing bias (0.1) instead of random normal values creates a more consistent expert distribution across benchmark runs, which is appropriate for reproducible performance measurements.

504-564: Well-structured routing helper function.

The function correctly handles DeepSeekV3 routing using the specialized fused_topk_deepseek kernel and falls back to simple top-k for other routing methods. The parameter validation for DeepSeekV3 is thorough.

One note: the comment on lines 559-561 acknowledges that Llama4 routing is approximated with simple top-k, which is acceptable for bandwidth estimation in benchmarks but worth keeping in mind if benchmark accuracy becomes critical for that routing method.

664-674: Good: Accurate expert selection for bandwidth calculation.

Using the actual routing method to compute selected experts ensures the bandwidth calculation reflects the true number of active experts rather than a theoretical estimate.

869-874: LGTM!

The bandwidth calculation now correctly uses:

nvfp4 format for proper byte accounting

Actual active expert count from routing

Verbose flag for debug output

1232-1236: LGTM!

Using the variant string directly as format correctly handles:

"base": falls through to dtype.itemsize (unquantized)

"fp8": returns 1.0 byte

"nvfp4": returns 0.5 + 1/16 bytes

The active_experts calculation correctly uses the unique expert count from routing.

450-457: The nvfp4 and mxfp4 byte calculations are correct and verified:

nvfp4: 0.5 + 1/16 = 0.5625 bytes/element (4-bit values + scale factor for 16-element blocks)

mxfp4: 0.5 + 1/32 = 0.53125 bytes/element (4-bit values + scale factor for 32-element blocks)

The scale factor overheads match the flashinfer quantization block sizes used in the codebase. No issues found.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

benchmarks/routines/moe.py (1)
438-438: Consider documenting the new verbose parameter.

The verbose parameter was added to calculate_moe_bandwidth but the docstring (lines 443-447) wasn't updated to document it. Consider adding a brief description for completeness.
📝 Suggested docstring addition
     Args:
         input_format: Override for input representation; None uses dtype.itemsize
         weight_format: Override for weight representation; None uses dtype.itemsize
         routing_logits_dtype: Dtype for routing logits memory accounting (default float32)
+        active_experts: Number of active experts; if None, estimated as min(num_experts, top_k * num_tokens)
+        verbose: Verbosity level for debug output (0=quiet, 2=print active expert count)
     """

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32fd692 and 2b192b3.

📒 Files selected for processing (1)

benchmarks/routines/moe.py

🧰 Additional context used

🪛 Ruff (0.14.10)

benchmarks/routines/moe.py

535-537: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (11)

benchmarks/routines/moe.py (11)

16-18: LGTM!

The new imports for fused_topk_deepseek and RoutingMethodType are necessary for the new routing computation helper and are correctly placed with related fused_moe imports.

321-324: LGTM!

Switching from random normal to uniform bias (0.1) is a sensible change for benchmarking. Random routing bias can produce highly skewed expert distributions that aren't representative of real workloads, making benchmark results harder to interpret and compare.

450-457: LGTM!

The effective byte calculations for FP4 formats correctly account for both the 4-bit data (0.5 bytes) and the block scale overhead (1/16 for nvfp4 with 16-element blocks, 1/32 for mxfp4 with 32-element blocks). This aligns with the quantization block sizes used elsewhere in the codebase.

504-563: LGTM!

The new _compute_routing_for_method helper correctly handles DeepSeekV3 routing via fused_topk_deepseek with proper parameter validation, while falling back to simple top-k for other routing methods. The approach of computing routing on the host to count unique experts is the right fix for accurate bandwidth calculation.

One minor note: topk_values (line 544) is allocated but unused after the fused_topk_deepseek call. This is acceptable overhead for benchmark setup code since we only need the indices for counting unique experts.

664-674: LGTM!

This is the core fix for the PR - computing routing on the host before benchmarking to determine actual expert assignments. The parameters correctly mirror those passed to the kernel, ensuring the bandwidth calculation reflects real kernel behavior.

869-874: LGTM!

The bandwidth calculation now correctly uses:

nvfp4 format for accurate FP4 byte accounting

selected_experts.unique().numel() to count actually activated experts

Verbose flag propagation for debugging

This fixes the original issue where effective bandwidth was overstated by assuming all experts were active.

1232-1236: LGTM!

The bandwidth calculation correctly uses the variant string ("base", "fp8", or "nvfp4") as the format specifier, which maps properly to the get_effective_bytes logic. The Cutlass path appropriately uses the existing _compute_routing since it doesn't require DeepSeekV3-specific routing.

1359-1369: LGTM!

Consistent application of the routing computation fix for the FP8 block scale benchmark, ensuring accurate active expert counting.

1505-1506: LGTM!

Active experts and verbose flag correctly propagated for FP8 block scale bandwidth calculation.

1628-1638: LGTM!

Consistent routing computation for the FP8 per-tensor scale benchmark.

1737-1738: LGTM!

Active experts and verbose flag correctly propagated for FP8 per-tensor scale bandwidth calculation.

nv-yunzheq · 2026-01-13T03:16:53Z

benchmarks/routines/moe.py

+            return 0.5 + 1 / 16
+        elif fmt == "mxfp4":
+            return 0.5 + 1 / 32
+        elif fmt == "fp8":


Is the weight for fp8 block-scaled? i.e. MXFP8?

I have not considered block-scale fp8 (DeepSeek style) yet. In that case, it should be 1 fp32 scale every 128x128 block for weight, and 1 fp32 scale every 128x1 for activation.

nv-yunzheq · 2026-01-13T03:17:41Z

LGTM.
For TRTLLM gen moe case, the bandwidth is still an estimate as the selected experts are not the exact same compute from the kernel itself. But it's better and closer to what we previously have.

rosenrodt · 2026-01-13T12:20:35Z

LGTM. For TRTLLM gen moe case, the bandwidth is still an estimate as the selected experts are not the exact same compute from the kernel itself. But it's better and closer to what we previously have.

Routing is computed internally in fairly high precision (bf16 or fp32). Therefore as long as the math is equivalent I think we're good. I will let @ChristinaZ comment if that is really the case for routing + topK in trtllm_block_scale_moe.

rosenrodt added 3 commits January 12, 2026 16:05

chore: calculate moe effective bw more faithfully

735f239

chore: print active expert cnt with -vv

c57fd18

chore: smoother expert distribution in bench

32fd692

rosenrodt requested review from Anerudhan, bkryu, cyx-6 and jiahanc as code owners January 13, 2026 01:48

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

address ai review feedback

2b192b3

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

nv-yunzheq reviewed Jan 13, 2026

View reviewed changes

yzh119 approved these changes Jan 13, 2026

View reviewed changes

yzh119 merged commit f9373aa into flashinfer-ai:main Jan 14, 2026
9 checks passed

This was referenced Jan 20, 2026

chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379

Merged

feat: add trtllm_fp8_block_scale_routed_moe API #2382

Closed

This was referenced Jan 27, 2026

feat: cuteDSL fp4 moe for better DSR1 performance. #2398

Merged

fix: blockscale moe routine supports non-DS routing #2476

Merged

coderabbitai bot mentioned this pull request Feb 25, 2026

benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark #2635

Merged

5 tasks

Conversation

rosenrodt commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

rosenrodt Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq commented Jan 13, 2026

Uh oh!

rosenrodt commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rosenrodt commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2026 •

edited

Loading