chore: update benchmark scripts; fix trtllm-gen moe comments by IwakuraRein · Pull Request #2412 · flashinfer-ai/flashinfer

IwakuraRein · 2026-01-24T02:57:32Z

📌 Description

Update trtllm-gen moe benchmark scripts to use Cupti and CUDA graph.
Fix the layout description in the trtllm-gen moe's comments

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added MXInt4 quantization support for MOE benchmarking and autotuning.
- Added a new quant-mode option ("MxInt4xBf16") and unified benchmarking/autotune workflows across quant modes (FP8/FP4/MXInt4).
Documentation
- Clarified weight-layout and tensor-shape requirements for MOE primitives, detailing MajorK vs BlockMajorK expectations.

coderabbitai · 2026-01-24T02:57:39Z

📝 Walkthrough

Walkthrough

Adds MXInt4 quantization and an MXInt4xBf16 autotuner benchmark path to the fused MoE benchmarking script, refactors FP8/FP4 bench flows to use functools.partial with input_kwargs, extends CLI quant-mode choices, and clarifies BlockMajorK weight-layout shapes in fused_moe/core docstrings.

Changes

Cohort / File(s)	Summary
Benchmarking & Quantization `benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`	Added `mxint4_quantize()` and `bench_trtllm_gen_fused_moe_autotuner_mxint4()`; updated `fp8_quantize()` return typing; introduced `functools.partial` wrappers and `input_kwargs` for FP8/FP4/MXInt4 bench flows; extended `bench_gpu_time` calls with new flags; exported `trtllm_mxint4_block_scale_moe`; added `MxInt4xBf16` CLI choice and dynamic dispatch.
Core docstrings / API shape guidance `flashinfer/fused_moe/core.py`	Clarified BlockMajorK-aware tensor shapes and strengthened `weight_layout` guidance in `trtllm_bf16_moe` and `trtllm_fp8_block_scale_moe` docstrings; documented MajorK vs BlockMajorK cases for gemm1/gemm2 weight shapes and harmonized wording across FP8 block-scale functions.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Runner
    participant Bench as Bench Script
    participant Quant as MXInt4 Quantizer
    participant TRTL as trtllm_mxint4_block_scale_moe
    participant GPU as GPU/Autotuner

    CLI->>Bench: invoke with quant_mode=MxInt4xBf16
    Bench->>Quant: mxint4_quantize(input_tensor)
    Quant-->>Bench: quantized_tensor + scales
    Bench->>TRTL: call trtllm_mxint4_block_scale_moe(quantized_tensor, scales, routing_bias, ...)
    TRTL->>GPU: submit kernels / autotune runs
    GPU-->>TRTL: perf metrics
    TRTL-->>Bench: results
    Bench->>CLI: report timings / best config

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE #2234: Related MXInt4/BF16 MOE API plumbing (routing_bias, routed_scaling_factor) that this bench path consumes.
[Test] Optimize test_trtllm_gen_fused_moe.py #2072: Autotune-related MoE changes that overlap with the new MXInt4 autotuner flows.
refactor: update dpsk fused_moe test [2] #2097: FP8 BlockMajorK block-scale API/test changes that overlap with docstring and FP8 adjustments.

Suggested reviewers

djmmoss
cyx-6
jiahanc
kahyunnam
Anerudhan

Poem

🐇 I hop and I measure with whiskers askew,
MXInt4 packed tight, scales shining true,
BlockMajorK shapes now neatly aligned,
Partials bind inputs, tidy and kind,
The autotuner hums — benchmarks anew!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: updating benchmark scripts for trtllm-gen MoE and fixing comments about layout descriptions, which aligns with the significant refactoring visible in the raw summary.
Description check	✅ Passed	The description provides a clear summary of the key changes (CUPTI/CUDA graph updates and layout description fixes), completes the template with checked items, but lacks linked related issues or specific reviewer notes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-24T02:57:50Z

Summary of Changes

Hello @IwakuraRein, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the benchmarking suite by introducing a new script for FP4 matrix multiplication, which now includes the Cutlass backend for performance evaluation. It also enhances the existing Mixture-of-Experts (MoE) benchmark script to incorporate MXINT4 quantization, allowing for more comprehensive performance analysis of different quantization strategies. These additions aim to provide better insights into the efficiency of various low-precision arithmetic implementations.

Highlights

New FP4 Matrix Multiplication Benchmark: A new benchmark script, benchmarks/bench_mm_fp4.py, has been added to evaluate the performance of FP4 matrix multiplication across different backends, including Cutlass, cuDNN, and TRTLLM. It measures performance with and without autotuning for various matrix dimensions.
MXINT4 Quantization for MoE Benchmarks: The existing Mixture-of-Experts (MoE) benchmark script, benchmarks/bench_trtllm_gen_fused_moe_autotuner.py, has been updated to include support for MXINT4 quantization. This involves adding a new mxint4_quantize function and a dedicated benchmark function for MXINT4 MoE operations.
Project Version Update: The project version has been incremented from 0.5.3 to 0.6.0 in version.txt, reflecting the new features and updates introduced in this pull request.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates benchmark scripts for MoE and adds a new benchmark for FP4 matrix multiplication. The changes are primarily within benchmark files. I've identified a few issues, including a potential bug in the new mxint4_quantize function and an incorrect return type in bench_mm_fp4.py. My comments provide suggestions to address these points.

gemini-code-assist · 2026-01-24T03:00:37Z

benchmarks/bench_mm_fp4.py

+    if not mm_fp4.is_backend_supported(backend, compute_capability_number):
+        print(
+            f"Skipping test for {backend} because it is not supported on compute capability {compute_capability_number}."
+        )
+        return


The function _bench_mm_fp4 is type-hinted to return a tuple[float, float], but the early return statements on lines 39, 44, 47, 50, and 53 return None. This will lead to a TypeError at the call site when attempting to unpack the None value. To ensure type consistency and prevent runtime errors, the function should always return a tuple of two floats. For skipped tests, you could return a sentinel value like (float('inf'), 0.0).

print( f"Skipping test for {backend} because it is not supported on compute capability {compute_capability_number}." ) return float('inf'), 0.0

gemini-code-assist · 2026-01-24T03:00:37Z

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

+    scales = amax / 8.0
+    x_scaled = x_reshaped * scales.reciprocal()
+    x_int8 = (
+        x_scaled.round().clamp(-8, 7).to(torch.uint8).reshape(-1, sf_vec_size // 2, 2)


Casting the float tensor from x_scaled.round().clamp(-8, 7) directly to torch.uint8 will cause all negative values to be clamped to 0, leading to incorrect quantization. To preserve the negative values, you should first cast to torch.int8 and then use .view(torch.uint8) to reinterpret the bits for the subsequent bitwise packing operations.

Suggested change

x_scaled.round().clamp(-8, 7).to(torch.uint8).reshape(-1, sf_vec_size // 2, 2)

x_scaled.round().clamp(-8, 7).to(torch.int8).view(torch.uint8).reshape(-1, sf_vec_size // 2, 2)

gemini-code-assist · 2026-01-24T03:00:38Z

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

+    return x_int4.reshape(*x.shape[:-1], x.shape[-1] // 2), scales.reshape(
+        -1, sf_vec_size
+    )


The scales.reshape(-1, sf_vec_size) operation can fail if the number of elements in scales is not a multiple of sf_vec_size. Since the caller of this function already reshapes the returned scales tensor, it's safer to remove this intermediate reshape and return the scales tensor directly. This avoids a potential runtime error and simplifies the function.

return x_int4.reshape(*x.shape[:-1], x.shape[-1] // 2), scales

gemini-code-assist · 2026-01-24T03:00:38Z

benchmarks/bench_mm_fp4.py

+        print("mx_fp4 is only supported for cudnn and auto backends")
+        return
+
+    input = torch.randn([m, k], device="cuda", dtype=torch.bfloat16)


The variable name input shadows the Python built-in function input(). This is generally discouraged as it can lead to confusion and potential bugs if the built-in is needed later in the scope. Consider renaming it to something more descriptive, like input_tensor, and applying this change to all its occurrences.

input_tensor = torch.randn([m, k], device="cuda", dtype=torch.bfloat16)

gemini-code-assist · 2026-01-24T03:00:38Z

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

+    x_reshaped = x.reshape(-1, sf_vec_size)
+    x_max = x_reshaped.max(dim=-1, keepdim=True)[0].to(torch.float32)
+    x_min = x_reshaped.min(dim=-1, keepdim=True)[0].to(torch.float32)
+    x_max = x_max * 8.0 / 7.0


The constant 8.0 / 7.0 is a magic number. To improve code readability and maintainability, it's better to define it as a named constant at the module level, for example: MXINT4_SCALE_ADJUSTMENT = 8.0 / 7.0.

version.txt

… scripts Signed-off-by: Siyuan Fu <[email protected]>

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@benchmarks/bench_mm_fp4.py`:
- Around line 38-55: The function _bench_mm_fp4 currently uses early bare
returns that yield None while its signature declares -> tuple[float, float];
update every early-exit path inside _bench_mm_fp4 (the branches printing
"Skipping test..." and the trtllm/mx_fp4 checks) to return a consistent tuple of
floats (e.g., (0.0, 0.0) or (float("nan"), float("nan"))), so callers like ms,
tflops = _bench_mm_fp4(...) can always unpack safely; keep the existing
print/log lines and replace each plain return with the chosen float tuple.

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 392-422: The partial for trtllm_mxint4_block_scale_moe binds
hidden_states and input_kwargs also includes "hidden_states", causing a
duplicate-argument TypeError; remove the hidden_states binding from the partial
call (leave it to be passed via input_kwargs) so trtllm_mxint4_block_scale_moe
is only given hidden_states once when calling fn(**input_kwargs).

🧹 Nitpick comments (2)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (2)
34-50: Minor dtype difference from test implementation.

Line 45 uses torch.uint8 for the intermediate x_int8, while the test implementation at tests/moe/test_trtllm_gen_fused_moe.py:590-606 uses torch.int8. Both produce the same packed result due to the bitwise masking (& 0x0F), so this is functionally equivalent.

The scale tensor reshape at lines 48-50 returns shape (-1, sf_vec_size) which appears to be inconsistent with how it's later reshaped at lines 380-384 and 386-390. The intermediate reshape seems unnecessary since scales is already the correct shape from line 42.
✨ Optional: simplify scale return
-    return x_int4.reshape(*x.shape[:-1], x.shape[-1] // 2), scales.reshape(
-        -1, sf_vec_size
-    )
+    return x_int4.reshape(*x.shape[:-1], x.shape[-1] // 2), scales.squeeze(-1)
This would return scales with shape (-1,) which is then reshaped at the call site anyway.
368-368: Minor: inconsistent device specification.

Line 368 uses device="cuda" while line 365 defines device = torch.device("cuda:0"). For consistency with the rest of the function, use the device variable.
✨ Proposed fix
-    routing_bias = torch.randn(num_experts, device="cuda", dtype=torch.bfloat16)
+    routing_bias = torch.randn(num_experts, device=device, dtype=torch.bfloat16)

benchmarks/bench_mm_fp4.py

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 34-50: The mxint4_quantize function uses torch.uint8 when
converting rounded/clamped values but the clamped range is signed (−8..7);
change the cast from torch.uint8 to torch.int8 in mxint4_quantize (the block
that creates x_int8) so the tensor uses a signed dtype consistent with the test
implementation, keep the subsequent bitwise masking and packing logic unchanged
and ensure scales reshape remains the same.

🧹 Nitpick comments (4)

benchmarks/bench_mm_fp4.py (2)
114-114: FIXME comment indicates a known issue with Cupti.

The comment notes that Cupti causes CUDA Illegal Memory Access. Consider opening an issue to track this problem so it can be properly investigated and resolved.

Would you like me to help draft an issue to track this Cupti-related bug?

150-159: Consider exposing fp4_type and res_dtype as CLI arguments.

The benchmark currently hardcodes "nvfp4" for fp4_type and torch.bfloat16 for res_dtype, but _bench_mm_fp4 supports other values like "mxfp4", "mxfp4_alpha", and torch.float16. Exposing these as CLI arguments would allow benchmarking all supported configurations.
♻️ Proposed enhancement
     parser.add_argument(
         "--backend", type=str, nargs="+", default=["cudnn", "trtllm", "cutlass"]
     )
+    parser.add_argument(
+        "--fp4-type", type=str, default="nvfp4", choices=["nvfp4", "mxfp4", "mxfp4_alpha"]
+    )
     args = parser.parse_args()

     for m, n, k in product(args.m, args.n, args.k):
         print(f"m={m}, n={n}, k={k}".center(100, "-"))
         for backend in args.backend:
             print(f"  {backend}:")
             ms, tflops = _bench_mm_fp4(
-                m, n, k, torch.bfloat16, backend, True, "nvfp4", False
+                m, n, k, torch.bfloat16, backend, True, args.fp4_type, False
             )
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (2)
354-364: Unused parameter quant_mode could be validated or documented.

The quant_mode parameter is declared but never used in the function body. Since this function only supports "MxInt4xBf16", consider either:

Adding an assertion to validate the expected mode

Removing the parameter if it's only for signature consistency
♻️ Option 1: Add validation
 def bench_trtllm_gen_fused_moe_autotuner_mxint4(
     tune_max_num_tokens: Optional[int],
     quant_mode: Literal["MxInt4xBf16"],
     ...
 ):
+    assert quant_mode == "MxInt4xBf16", f"Unsupported quant_mode: {quant_mode}"
     device = torch.device("cuda:0")
484-490: Consider dict-based dispatch for better maintainability.

The nested ternary works but may become harder to maintain as more quantization modes are added. A dictionary-based dispatch would be clearer.
♻️ Proposed refactor
-    fn = (
-        bench_trtllm_gen_fused_moe_autotuner_fp8
-        if args.quant_mode in ["Fp8-Per-Tensor", "Fp8-Block"]
-        else bench_trtllm_gen_fused_moe_autotuner_mxint4
-        if args.quant_mode == "MxInt4xBf16"
-        else bench_trtllm_gen_fused_moe_autotuner_fp4
-    )
+    dispatch = {
+        "Fp8-Per-Tensor": bench_trtllm_gen_fused_moe_autotuner_fp8,
+        "Fp8-Block": bench_trtllm_gen_fused_moe_autotuner_fp8,
+        "MxInt4xBf16": bench_trtllm_gen_fused_moe_autotuner_mxint4,
+        "NvFP4xNvFP4": bench_trtllm_gen_fused_moe_autotuner_fp4,
+        "MxFP4xMxFP8": bench_trtllm_gen_fused_moe_autotuner_fp4,
+        "MxFP4xBf16": bench_trtllm_gen_fused_moe_autotuner_fp4,
+    }
+    fn = dispatch[args.quant_mode]

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 354-364: The function bench_trtllm_gen_fused_moe_autotuner_mxint4
declares an unused parameter quant_mode which triggers an ARG001 lint; rename
quant_mode to _quant_mode in the function signature so the parameter intent is
clear and Ruff treats it as intentionally unused, and update any callers or
references if they exist to match the new parameter name.

coderabbitai · 2026-02-02T19:00:42Z

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

+def bench_trtllm_gen_fused_moe_autotuner_mxint4(
+    tune_max_num_tokens: Optional[int],
+    quant_mode: Literal["MxInt4xBf16"],
+    num_tokens: int,
+    num_experts: int,
+    hidden_size: int,
+    intermediate_size: int,
+    top_k: int,
+    warmups: int,
+    iterations: int,
+):


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's look at the full function to see if quant_mode is used anywhere # Search for the function and capture its content sed -n '354,400p' benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

Repository: flashinfer-ai/flashinfer

Length of output: 1557

🏁 Script executed:

#!/bin/bash # Find where the function ends and check for quant_mode references within it # First, find the line count of the file wc -l benchmarks/bench_trtllm_gen_fused_moe_autotuner.py # Then search for the function and any references to quant_mode rg -A 150 "def bench_trtllm_gen_fused_moe_autotuner_mxint4\(" benchmarks/bench_trtllm_gen_fused_moe_autotuner.py | head -200

Repository: flashinfer-ai/flashinfer

Length of output: 4806

Rename unused quant_mode parameter to _quant_mode to satisfy Ruff (ARG001).

The parameter is declared but never referenced in the function body.

🔧 Proposed fix

def bench_trtllm_gen_fused_moe_autotuner_mxint4( tune_max_num_tokens: Optional[int], - quant_mode: Literal["MxInt4xBf16"], + _quant_mode: Literal["MxInt4xBf16"], num_tokens: int,

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 356-356: Unused function argument: quant_mode

(ARG001)

🤖 Prompt for AI Agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py` around lines 354 - 364, The function bench_trtllm_gen_fused_moe_autotuner_mxint4 declares an unused parameter quant_mode which triggers an ARG001 lint; rename quant_mode to _quant_mode in the function signature so the parameter intent is clear and Ruff treats it as intentionally unused, and update any callers or references if they exist to match the new parameter name.

bkryu

Hi @IwakuraRein, today the flashinfer_benchmark.py supports benchmarking mm_fp4. Example commands are:

flashinfer_benchmark.py --routine mm_fp4 --m 256 --n 1024 --k 7168 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --refcheck -vv --use_cupti
flashinfer_benchmark.py --routine mm_fp4 --m 64 --n 8192 --k 2048 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --autotune --refcheck -vv --use_cupti

It seems like bench_mm_fp4.py is a recreation of the functionality without much being added. Is there a reason to create a separate script for it?

benchmarks/bench_mm_fp4.py

IwakuraRein · 2026-02-02T22:02:08Z

Hi @IwakuraRein, today the flashinfer_benchmark.py supports benchmarking mm_fp4. Example commands are:
flashinfer_benchmark.py --routine mm_fp4 --m 256 --n 1024 --k 7168 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --refcheck -vv --use_cupti
flashinfer_benchmark.py --routine mm_fp4 --m 64 --n 8192 --k 2048 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --autotune --refcheck -vv --use_cupti
It seems like bench_mm_fp4.py is a recreation of the functionality without much being added. Is there a reason to create a separate script for it?

Thanks. I didn't realize there is a benchmark for fp4. Let me remove the bench_mm_fp4.py

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein · 2026-02-04T18:17:20Z

/bot run

flashinfer-bot · 2026-02-04T18:18:24Z

GitLab MR !296 has been created, and the CI pipeline #43286865 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-04T22:31:42Z

[CANCELING] Pipeline #43286865: canceled

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Pipeline failures: pre-commit hooks need to be re-run.

The CI shows two pre-commit hook failures:

end-of-file-fixer modified files (likely missing newline at EOF)

ruff-format reformatted 1 file

Please run pre-commit run --all-files and commit the changes to fix these issues.

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 374-375: The code creates routing_bias with a hardcoded
device="cuda" which is inconsistent with the local device variable (e.g.,
torch.device("cuda:0")) and can break multi-GPU runs; update the routing_bias
creation to use the same device variable (routing_bias =
torch.randn(num_experts, device=device, dtype=...) ) so both routing_logits and
routing_bias use the unified device, ensuring consistent behaviour across GPUs
and matching dtype handling with routing_logits where needed.

🧹 Nitpick comments (1)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
500-506: Consider dict-based dispatch for extensibility.

The nested ternary works but becomes harder to maintain as more quant modes are added. A dict mapping would be cleaner.
♻️ Optional refactor using dict dispatch
bench_functions = {
    "Fp8-Per-Tensor": bench_trtllm_gen_fused_moe_autotuner_fp8,
    "Fp8-Block": bench_trtllm_gen_fused_moe_autotuner_fp8,
    "MxInt4xBf16": bench_trtllm_gen_fused_moe_autotuner_mxint4,
}
fn = bench_functions.get(args.quant_mode, bench_trtllm_gen_fused_moe_autotuner_fp4)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 28-51: The mxint4_quantize function lacks validation for the
input's last-dimension divisibility by sf_vec_size and can produce inf/NaN when
blocks are zero; add an input check at the start of mxint4_quantize that raises
a clear ValueError if x.shape[-1] % sf_vec_size != 0, then after computing amax
ensure you clamp amax to a small epsilon (e.g., torch.finfo(x.dtype).tiny or
1e-8) before computing scales = amax / 8.0 so scales.reciprocal() cannot produce
infinities; keep the rest of the existing reshaping/packing logic and return
types as-is, referencing mxint4_quantize, x_reshaped, amax, scales, and x_int4
to locate the changes.

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

Signed-off-by: Siyuan Fu <[email protected]>

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

yzh119 reviewed Jan 25, 2026

View reviewed changes

version.txt Outdated Show resolved Hide resolved

IwakuraRein force-pushed the mxint4-moe-benchmark branch from b3c8b1e to 4d76c59 Compare January 29, 2026 00:58

update trtllm-gen moe benchmark scripts; add cutlass fp4 mm benchmark…

14550ae

… scripts Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein force-pushed the mxint4-moe-benchmark branch from 4d76c59 to 14550ae Compare January 29, 2026 01:01

IwakuraRein added 2 commits January 29, 2026 01:39

update comments in core.py

f078263

Signed-off-by: Siyuan Fu <[email protected]>

update bench_mm_fp4.py

c33629f

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein changed the title ~~chore: update trtllm-gen moe benchmark scripts; add cutlass fp4 mm benchmark scripts~~ chore: update benchmark scripts; fix trtllm-gen moe comments Jan 29, 2026

IwakuraRein marked this pull request as ready for review January 29, 2026 02:04

IwakuraRein requested review from Anerudhan, bkryu, cyx-6, djmmoss, jiahanc, kahyunnam and nv-yunzheq as code owners January 29, 2026 02:04

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

benchmarks/bench_mm_fp4.py Outdated Show resolved Hide resolved

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py Show resolved Hide resolved

address comments

8c28758

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py Outdated Show resolved Hide resolved

address comment

4d4ad83

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

bkryu reviewed Feb 2, 2026

View reviewed changes

yzh119 reviewed Feb 2, 2026

View reviewed changes

benchmarks/bench_mm_fp4.py Outdated Show resolved Hide resolved

remove bench_mm_fp4.py

1dc896c

Signed-off-by: Siyuan Fu <[email protected]>

Merge branch 'main' into mxint4-moe-benchmark

5c76e6e

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py Outdated Show resolved Hide resolved

fix typo

bd1eb6d

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py Show resolved Hide resolved

upd

c489394

Signed-off-by: Siyuan Fu <[email protected]>

yzh119 approved these changes Feb 6, 2026

View reviewed changes

IwakuraRein enabled auto-merge (squash) February 6, 2026 19:55

IwakuraRein added 2 commits February 9, 2026 23:47

Merge remote-tracking branch 'upstream/main' into mxint4-moe-benchmark

1f585c5

Merge remote-tracking branch 'upstream/main' into mxint4-moe-benchmark

bc3b392

IwakuraRein merged commit 1d350ae into flashinfer-ai:main Feb 11, 2026
28 checks passed

coderabbitai bot mentioned this pull request Feb 12, 2026

Feat: Trtllm-gen MxFP8 MoE integration #2505

Merged

5 tasks

	x_scaled.round().clamp(-8, 7).to(torch.uint8).reshape(-1, sf_vec_size // 2, 2)
	x_scaled.round().clamp(-8, 7).to(torch.int8).view(torch.uint8).reshape(-1, sf_vec_size // 2, 2)

Conversation

IwakuraRein commented Jan 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IwakuraRein commented Feb 2, 2026

Uh oh!

IwakuraRein commented Feb 4, 2026

Uh oh!

flashinfer-bot commented Feb 4, 2026

Uh oh!

flashinfer-bot commented Feb 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

IwakuraRein commented Jan 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 24, 2026 •

edited

Loading