Skip to content

tests: add bias testing to nvfp4 moe#2585

Merged
jimmyzho merged 5 commits intoflashinfer-ai:mainfrom
jimmyzho:bias
Feb 20, 2026
Merged

tests: add bias testing to nvfp4 moe#2585
jimmyzho merged 5 commits intoflashinfer-ai:mainfrom
jimmyzho:bias

Conversation

@jimmyzho
Copy link
Contributor

@jimmyzho jimmyzho commented Feb 18, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Tests

    • Added a test exercising GEMM bias combinations for low-precision (FP4) mixture-of-experts (bias on GEMM1, GEMM2, or both).
    • Extended the test harness to validate GEMM biases across production, CUDA-graph, and reference/dequant paths.
  • Refactor

    • Extended argument/config surfaces to accept and propagate GEMM biases through all runtime and reference paths, ensuring biases are applied where relevant.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jimmyzho, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the NVFP4 Mixture of Experts (MoE) implementation by adding comprehensive bias support. The changes involve modifying the argument structures to accommodate bias terms for both GEMM operations within the MoE layer and updating the dequantization and reference computation logic to correctly apply these biases. The primary impact is the introduction of new test cases that verify the accuracy of the NVFP4 MoE kernel when biases are present, ensuring the robustness and correctness of the MoE functionality under these conditions.

Highlights

  • Bias Parameter Integration: Added gemm1_bias and gemm2_bias parameters to the moe_args and moe_args_dequant classes to support bias terms in Mixture of Experts (MoE) operations.
  • Bias Application Logic: Implemented conditional logic within the run_moe_dequant function to apply gemm1_bias and gemm2_bias during the respective GEMM operations if they are provided.
  • Reference Function Updates: Updated several reference functions (run_moe_reference_fp4, dequant_reference_dsfp8, run_moe_reference_per_tensor_scale_fp8, run_moe_reference_bf16, dequantize) to correctly pass the new bias parameters to the dequantization process.
  • New Bias Test Cases: Introduced a new helper function _run_fp4_moe_with_bias and three new parameterized test functions (test_nvfp4_moe_gemm2_bias, test_nvfp4_moe_gemm1_bias, test_nvfp4_moe_both_biases) to thoroughly validate the bias functionality for NVFP4 MoE layers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/moe/test_trtllm_gen_fused_moe.py
    • Added gemm1_bias and gemm2_bias parameters to moe_args and moe_args_dequant constructors.
    • Modified run_moe_dequant to apply gemm1_bias and gemm2_bias if provided.
    • Updated run_moe_reference_fp4, dequant_reference_dsfp8, run_moe_reference_per_tensor_scale_fp8, run_moe_reference_bf16, and dequantize to pass bias parameters.
    • Introduced _run_fp4_moe_with_bias helper function for bias testing.
    • Added test_nvfp4_moe_gemm2_bias, test_nvfp4_moe_gemm1_bias, and test_nvfp4_moe_both_biases for comprehensive bias validation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Adds runtime-configurable GEMM bias support for FP4 MoE by introducing gemm1_bias and gemm2_bias, wiring them through argument containers, runtime/config propagation, CUDA-graph invocation, and reference/dequant paths; includes tests exercising bias combinations.

Changes

Cohort / File(s) Summary
FP4 MoE bias wiring & tests
tests/moe/test_trtllm_gen_fused_moe.py
Added public fields gemm1_bias and gemm2_bias to moe_args and moe_args_dequant; propagate biases through runtime config dict, CUDA-graph call sites, and all reference/dequant paths; apply biases after GEMM1 and GEMM2 when provided; added test_nvfp4_moe_gemm_bias to exercise gemm1/gemm2/both combos.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

op: moe

Suggested reviewers

  • aleozlx
  • bkryu
  • cyx-6
  • yzh119
  • jiahanc

Poem

🐰 I nudged two biases into the flow,
GEMM1, GEMM2 — now onward they go,
Tests hop past gates with joyful cheer,
Tiny changes, big results near! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is mostly a template with empty sections; the Description and Reviewer Notes sections lack actual content about the changes. Fill in the Description section with details about what bias testing was added and why, and optionally add any reviewer notes or concerns.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding bias testing to the NvFP4 MoE code path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds test coverage for bias support in the nvfp4 MoE implementation. The changes are well-contained within the test file and correctly add gemm1_bias and gemm2_bias to the reference implementations and new tests. The new tests for gemm1_bias, gemm2_bias, and both biases together are comprehensive. I have one suggestion to improve the maintainability of the new test code by refactoring duplicated logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/moe/test_trtllm_gen_fused_moe.py (2)

3195-3207: Set torch.random.manual_seed(0) before creating bias tensors for full reproducibility.

In all three test functions, torch.random.manual_seed(0) is called after the bias tensors are created via torch.randn(...). This means bias values depend on whatever RNG state was left by the prior parametrized test case. While the reference-vs-kernel comparison is still valid (both see the same bias), this makes individual test failures harder to reproduce in isolation.

Proposed fix (example for `test_nvfp4_moe_gemm2_bias`; apply analogously to the other two)
     num_experts, top_k = 8, 2
     device = "cuda"
 
+    torch.random.manual_seed(0)
     # gemm2_bias shape: [num_experts, hidden_size], dtype float32
     gemm2_bias = torch.randn(
         (num_experts, hidden_size), device=device, dtype=torch.float32
     )
 
-    torch.random.manual_seed(0)
     kernel_output, ref_output = _run_fp4_moe_with_bias(

Also applies to: 3234-3246, 3271-3287

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 3195 - 3207, The bias
tensors (e.g., gemm2_bias) are created before seeding the RNG, so their values
vary with prior test RNG state; move the torch.random.manual_seed(0) call to
immediately before creating each bias tensor (i.e., call
torch.random.manual_seed(0) before the torch.randn(...) that produces gemm2_bias
in test_nvfp4_moe_gemm2_bias and the two analogous test functions) so that the
bias is deterministically reproducible while leaving the subsequent calls to
_run_fp4_moe_with_bias unchanged.

3138-3138: Inconsistent weight_processing dict key: "shuffle" instead of "use_shuffled_weight".

All other call sites (e.g., run_moe_test at Line 2545, FP8BlockScaleMoe.prepare_static_weights_for_kernel at Line 946) use "use_shuffled_weight" as the key. FP4Moe.prepare_static_weights_for_kernel happens to ignore the weight_processing parameter entirely, so this doesn't cause a runtime failure today, but it would silently break if FP4 ever starts using that dict.

Proposed fix
-        {"shuffle": True, "layout": WeightLayout.MajorK},
+        {"use_shuffled_weight": True, "layout": WeightLayout.MajorK},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` at line 3138, The weight_processing
dict in the test uses the wrong key "shuffle" — change that entry to
"use_shuffled_weight": True so it matches other call sites (see run_moe_test
usage and FP8BlockScaleMoe.prepare_static_weights_for_kernel /
FP4Moe.prepare_static_weights_for_kernel expectations); update the dict in the
test_trtllm_gen_fused_moe test case where {"shuffle": True, "layout":
WeightLayout.MajorK} appears to {"use_shuffled_weight": True, "layout":
WeightLayout.MajorK} to ensure consistent behavior if FP4 starts honoring the
parameter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 3172-3174: The test passes inconsistent types for
routing_method_type and activation_type: the target function signature expects
ints (routing_method_type: int = 0, activation_type: int =
ActivationType.Swiglu.value) but some call sites supply enum objects (e.g.,
self.config["activation_type"]) while others use .value; update all callers to
pass the enum's integer value (use .value) consistently—e.g., change usages of
self.config["activation_type"] or other enum instances to
self.config["activation_type"].value (and likewise for routing_method_type) so
every call to the function uses integer values matching the signature.

---

Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 3195-3207: The bias tensors (e.g., gemm2_bias) are created before
seeding the RNG, so their values vary with prior test RNG state; move the
torch.random.manual_seed(0) call to immediately before creating each bias tensor
(i.e., call torch.random.manual_seed(0) before the torch.randn(...) that
produces gemm2_bias in test_nvfp4_moe_gemm2_bias and the two analogous test
functions) so that the bias is deterministically reproducible while leaving the
subsequent calls to _run_fp4_moe_with_bias unchanged.
- Line 3138: The weight_processing dict in the test uses the wrong key "shuffle"
— change that entry to "use_shuffled_weight": True so it matches other call
sites (see run_moe_test usage and
FP8BlockScaleMoe.prepare_static_weights_for_kernel /
FP4Moe.prepare_static_weights_for_kernel expectations); update the dict in the
test_trtllm_gen_fused_moe test case where {"shuffle": True, "layout":
WeightLayout.MajorK} appears to {"use_shuffled_weight": True, "layout":
WeightLayout.MajorK} to ensure consistent behavior if FP4 starts honoring the
parameter.

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

bot comments are reasonable to address, pls take a look

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/moe/test_trtllm_gen_fused_moe.py (2)

571-572: Consider using .get() for backward-compatible kwargs access.

kwargs["gemm1_bias"] / kwargs["gemm2_bias"] will raise KeyError if any future caller of call_moe omits these. Using .get("gemm1_bias", None) is consistent with how enable_autotune is already handled in this same method.

♻️ Proposed fix
-        gemm1_bias = kwargs["gemm1_bias"]
-        gemm2_bias = kwargs["gemm2_bias"]
+        gemm1_bias = kwargs.get("gemm1_bias", None)
+        gemm2_bias = kwargs.get("gemm2_bias", None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 571 - 572, In call_moe,
replace direct dict indexing for gemm1_bias and gemm2_bias with safe retrieval
using kwargs.get("gemm1_bias", None) and kwargs.get("gemm2_bias", None) (same
pattern used for enable_autotune) so missing callers won't raise KeyError;
update the references to gemm1_bias and gemm2_bias in that function accordingly.

2186-2187: Bias propagation added to non-FP4 reference paths without corresponding production support.

run_moe_reference_dsfp8, run_moe_reference_bf16, run_moe_reference_per_tensor_scale_fp8, and run_moe_reference_mxint4 now forward gemm1_bias/gemm2_bias into run_moe_dequant, but their production counterparts (trtllm_fp8_block_scale_moe, trtllm_bf16_moe, etc.) do not accept or apply biases. Any future test that passes non-None biases with these quant modes will silently mismatch between reference and production outputs. Consider adding an assertion in those reference functions that biases are None if production doesn't support them, e.g.:

assert args.gemm1_bias is None and args.gemm2_bias is None, \
    "GEMM bias not supported for FP8/BF16/MxInt4 production kernels"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 2186 - 2187, The
reference functions run_moe_reference_dsfp8, run_moe_reference_bf16,
run_moe_reference_per_tensor_scale_fp8, and run_moe_reference_mxint4 are
forwarding gemm1_bias/gemm2_bias into run_moe_dequant while their production
counterparts (trtllm_fp8_block_scale_moe, trtllm_bf16_moe, etc.) do not support
biases; add an assertion at the start of each of those reference functions
(before calling run_moe_dequant) that args.gemm1_bias is None and
args.gemm2_bias is None with a clear message like "GEMM bias not supported for
FP8/BF16/MxInt4 production kernels" so tests fail-fast when non-None biases are
passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 571-572: In call_moe, replace direct dict indexing for gemm1_bias
and gemm2_bias with safe retrieval using kwargs.get("gemm1_bias", None) and
kwargs.get("gemm2_bias", None) (same pattern used for enable_autotune) so
missing callers won't raise KeyError; update the references to gemm1_bias and
gemm2_bias in that function accordingly.
- Around line 2186-2187: The reference functions run_moe_reference_dsfp8,
run_moe_reference_bf16, run_moe_reference_per_tensor_scale_fp8, and
run_moe_reference_mxint4 are forwarding gemm1_bias/gemm2_bias into
run_moe_dequant while their production counterparts (trtllm_fp8_block_scale_moe,
trtllm_bf16_moe, etc.) do not support biases; add an assertion at the start of
each of those reference functions (before calling run_moe_dequant) that
args.gemm1_bias is None and args.gemm2_bias is None with a clear message like
"GEMM bias not supported for FP8/BF16/MxInt4 production kernels" so tests
fail-fast when non-None biases are passed.

@jimmyzho jimmyzho requested a review from aleozlx February 19, 2026 23:02
@jimmyzho
Copy link
Contributor Author

@aleozlx Just refactored the test and now it is directly calling run_moe_test, could you pls take another look?

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aleozlx
Copy link
Collaborator

aleozlx commented Feb 20, 2026

/bot run

@aleozlx aleozlx added the run-ci label Feb 20, 2026
@flashinfer-bot
Copy link
Collaborator

GitLab MR !334 has been created, and the CI pipeline #44471261 is currently running. I'll report back once the pipeline job completes.

cicd
@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44471261: 14/20 passed

@jimmyzho jimmyzho removed the run-ci label Feb 20, 2026
@jimmyzho jimmyzho enabled auto-merge (squash) February 20, 2026 23:43
@jimmyzho jimmyzho merged commit 3000467 into flashinfer-ai:main Feb 20, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants