chore: MoE benchmark effective BW fix for trtllm_block_scale_moe#2341
chore: MoE benchmark effective BW fix for trtllm_block_scale_moe#2341yzh119 merged 4 commits intoflashinfer-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughAdded a new helper Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @rosenrodt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Mixture-of-Experts (MoE) benchmarking framework by addressing critical inaccuracies in effective bandwidth calculations. It introduces a more granular approach to determine the memory footprint of different FP4 quantization schemes and ensures that the count of active experts, crucial for bandwidth metrics, is precisely derived from the specific routing algorithm in use. These improvements lead to more robust and accurate performance assessments for MoE models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the MoE benchmark to more accurately calculate the effective bandwidth, particularly for the trtllm_block_scale_moe routine. This is achieved by introducing a new function _compute_routing_for_method to determine the actual number of active experts based on the specific routing method, like DeepSeekV3. The changes are logical and improve the accuracy of the benchmark. I've added a few suggestions to improve code consistency and maintainability.
| # Compute selected experts for accurate bandwidth calculation | ||
| # Use the actual routing method to get correct expert assignments | ||
| selected_experts = _compute_routing_for_method( | ||
| routing_logits=routing_logits, | ||
| routing_bias=routing_bias, | ||
| top_k=top_k, | ||
| routing_method_type=routing_method_type, | ||
| n_group=n_group, | ||
| topk_group=topk_group, | ||
| routed_scaling_factor=routed_scaling_factor, | ||
| ) |
There was a problem hiding this comment.
This block of code to compute selected_experts is duplicated in testTrtllmFp4BlockScaleMoe (lines 664-674), testTrtllmFp8BlockScaleMoe (lines 1358-1368), and testTrtllmFp8PerTensorScaleMoe (lines 1626-1636). To improve maintainability and reduce redundancy, consider extracting this logic into a helper function. This would make the code cleaner and easier to update in the future.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
benchmarks/routines/moe.py (3)
1492-1505: Missingverboseparameter for consistency.The
calculate_moe_bandwidthcall here doesn't includeverbose=args.verbose, unliketestTrtllmFp4BlockScaleMoe(line 873). This means-vvwon't print the active expert count for this routine.♻️ Suggested fix
tb_per_sec = calculate_moe_bandwidth( num_tokens, hidden_size, intermediate_size, num_experts, top_k, median_time, input_dtype, weight_dtype, input_format="fp8", weight_format="fp8", routing_logits_dtype=routing_logits.dtype, active_experts=int(selected_experts.unique().numel()), + verbose=args.verbose, )
1723-1736: Missingverboseparameter for consistency.Same as
testTrtllmFp8BlockScaleMoe, theverboseparameter is not passed here. For consistent behavior withtestTrtllmFp4BlockScaleMoe, consider adding it.♻️ Suggested fix
tb_per_sec = calculate_moe_bandwidth( num_tokens, hidden_size, intermediate_size, num_experts, top_k, median_time, input_dtype, weight_dtype, input_format="fp8", weight_format="fp8", routing_logits_dtype=routing_logits.dtype, active_experts=int(selected_experts.unique().numel()), + verbose=args.verbose, )
1223-1236: Consider addingverboseparameter here as well.For consistency with
testTrtllmFp4BlockScaleMoe, this call could also passverbose=args.verboseto enable-vvdebug output for active expert count.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
benchmarks/routines/moe.py
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/routines/moe.py (3)
flashinfer/fused_moe/fused_routing_dsv3.py (1)
fused_topk_deepseek(119-194)flashinfer/fused_moe/core.py (1)
RoutingMethodType(61-75)csrc/trtllm_fused_moe_kernel_launcher.cu (13)
routing_bias(158-164)routing_logits(147-155)args(142-144)args(419-428)args(419-421)args(536-558)args(536-538)args(728-752)args(728-730)args(959-978)args(959-960)args(1133-1160)args(1133-1136)
🪛 Ruff (0.14.10)
benchmarks/routines/moe.py
535-537: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
benchmarks/routines/moe.py (7)
16-18: LGTM!The new imports for
fused_topk_deepseekandRoutingMethodTypeare correctly added and align with their usage in_compute_routing_for_method.
321-324: LGTM!Using uniform routing bias (0.1) instead of random normal values creates a more consistent expert distribution across benchmark runs, which is appropriate for reproducible performance measurements.
504-564: Well-structured routing helper function.The function correctly handles DeepSeekV3 routing using the specialized
fused_topk_deepseekkernel and falls back to simple top-k for other routing methods. The parameter validation for DeepSeekV3 is thorough.One note: the comment on lines 559-561 acknowledges that Llama4 routing is approximated with simple top-k, which is acceptable for bandwidth estimation in benchmarks but worth keeping in mind if benchmark accuracy becomes critical for that routing method.
664-674: Good: Accurate expert selection for bandwidth calculation.Using the actual routing method to compute selected experts ensures the bandwidth calculation reflects the true number of active experts rather than a theoretical estimate.
869-874: LGTM!The bandwidth calculation now correctly uses:
nvfp4format for proper byte accounting- Actual active expert count from routing
- Verbose flag for debug output
1232-1236: LGTM!Using the variant string directly as format correctly handles:
"base": falls through todtype.itemsize(unquantized)"fp8": returns 1.0 byte"nvfp4": returns 0.5 + 1/16 bytesThe
active_expertscalculation correctly uses the unique expert count from routing.
450-457: The nvfp4 and mxfp4 byte calculations are correct and verified:
- nvfp4: 0.5 + 1/16 = 0.5625 bytes/element (4-bit values + scale factor for 16-element blocks)
- mxfp4: 0.5 + 1/32 = 0.53125 bytes/element (4-bit values + scale factor for 32-element blocks)
The scale factor overheads match the flashinfer quantization block sizes used in the codebase. No issues found.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
benchmarks/routines/moe.py (1)
438-438: Consider documenting the newverboseparameter.The
verboseparameter was added tocalculate_moe_bandwidthbut the docstring (lines 443-447) wasn't updated to document it. Consider adding a brief description for completeness.📝 Suggested docstring addition
Args: input_format: Override for input representation; None uses dtype.itemsize weight_format: Override for weight representation; None uses dtype.itemsize routing_logits_dtype: Dtype for routing logits memory accounting (default float32) + active_experts: Number of active experts; if None, estimated as min(num_experts, top_k * num_tokens) + verbose: Verbosity level for debug output (0=quiet, 2=print active expert count) """
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
benchmarks/routines/moe.py
🧰 Additional context used
🪛 Ruff (0.14.10)
benchmarks/routines/moe.py
535-537: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (11)
benchmarks/routines/moe.py (11)
16-18: LGTM!The new imports for
fused_topk_deepseekandRoutingMethodTypeare necessary for the new routing computation helper and are correctly placed with related fused_moe imports.
321-324: LGTM!Switching from random normal to uniform bias (0.1) is a sensible change for benchmarking. Random routing bias can produce highly skewed expert distributions that aren't representative of real workloads, making benchmark results harder to interpret and compare.
450-457: LGTM!The effective byte calculations for FP4 formats correctly account for both the 4-bit data (0.5 bytes) and the block scale overhead (1/16 for nvfp4 with 16-element blocks, 1/32 for mxfp4 with 32-element blocks). This aligns with the quantization block sizes used elsewhere in the codebase.
504-563: LGTM!The new
_compute_routing_for_methodhelper correctly handles DeepSeekV3 routing viafused_topk_deepseekwith proper parameter validation, while falling back to simple top-k for other routing methods. The approach of computing routing on the host to count unique experts is the right fix for accurate bandwidth calculation.One minor note:
topk_values(line 544) is allocated but unused after thefused_topk_deepseekcall. This is acceptable overhead for benchmark setup code since we only need the indices for counting unique experts.
664-674: LGTM!This is the core fix for the PR - computing routing on the host before benchmarking to determine actual expert assignments. The parameters correctly mirror those passed to the kernel, ensuring the bandwidth calculation reflects real kernel behavior.
869-874: LGTM!The bandwidth calculation now correctly uses:
nvfp4format for accurate FP4 byte accountingselected_experts.unique().numel()to count actually activated experts- Verbose flag propagation for debugging
This fixes the original issue where effective bandwidth was overstated by assuming all experts were active.
1232-1236: LGTM!The bandwidth calculation correctly uses the
variantstring ("base", "fp8", or "nvfp4") as the format specifier, which maps properly to theget_effective_byteslogic. The Cutlass path appropriately uses the existing_compute_routingsince it doesn't require DeepSeekV3-specific routing.
1359-1369: LGTM!Consistent application of the routing computation fix for the FP8 block scale benchmark, ensuring accurate active expert counting.
1505-1506: LGTM!Active experts and verbose flag correctly propagated for FP8 block scale bandwidth calculation.
1628-1638: LGTM!Consistent routing computation for the FP8 per-tensor scale benchmark.
1737-1738: LGTM!Active experts and verbose flag correctly propagated for FP8 per-tensor scale bandwidth calculation.
| return 0.5 + 1 / 16 | ||
| elif fmt == "mxfp4": | ||
| return 0.5 + 1 / 32 | ||
| elif fmt == "fp8": |
There was a problem hiding this comment.
Is the weight for fp8 block-scaled? i.e. MXFP8?
There was a problem hiding this comment.
I have not considered block-scale fp8 (DeepSeek style) yet. In that case, it should be 1 fp32 scale every 128x128 block for weight, and 1 fp32 scale every 128x1 for activation.
|
LGTM. |
Routing is computed internally in fairly high precision (bf16 or fp32). Therefore as long as the math is equivalent I think we're good. I will let @ChristinaZ comment if that is really the case for routing + topK in trtllm_block_scale_moe. |
📌 Description
The MoE benchmark script overestimates the num bytes loaded by assuming all experts are active. I saw effective BW exceeds 3x the peak BW of some system as a result. The fix is to calculate the routed experts (topk_ids) on the host side and count the unique number of experts, the same logic
cutlass_fused_moedoes.While investigating the above issue, I also found data init of routing_bias using
rand()results in very skewed expert distribution (repro cmd below gives 18 active out of 128 experts). I'd like to change it toones()*0.1for smoother expert distribution (noe giving 114 out of 128), while maintaining the same load/compute behavior in the kernels.🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.