chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs by IwakuraRein · Pull Request #2548 · flashinfer-ai/flashinfer

IwakuraRein · 2026-02-12T18:50:37Z

📌 Description

This PR aims to add the do_finalize to the trtllm-gen fp8/f16 MoE APIs to align with fp4 APIs. This also allows flexible post processing of MoE.

Additionally, fix the bug that the output tensors are allocated twice for bf16/fp8 MoE.

The API changes include:

A new argument do_finalize
Return type: torch.Tensor -> List[torch.Tensor]

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added do_finalize (default: true) to MoE entry points to toggle final vs intermediate outputs.
Refactor
- MoE operations now return lists/arrays of tensors; finalized calls return a single final tensor, non-finalized calls return intermediate components.
Tests
- Tests updated to handle list/array returns (extract finalized tensor where applicable).
Documentation
- Docstrings updated to reflect do_finalize behavior and new return formats.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai · 2026-02-12T18:50:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Entry points for fused MoE now accept a do_finalize flag and return arrays/lists of tensors instead of single tensors; do_finalize is threaded through Python wrappers, FFI entry points, and the CUDA launcher, changing whether final outputs or intermediate tensors are returned across BF16/FP8/FP4/MXInt4 variants.

Changes

Cohort / File(s)	Summary
C++ CUDA Kernel Launcher `csrc/trtllm_fused_moe_kernel_launcher.cu`	Entry-point signatures changed to return `Array<Tensor>` and accept `do_finalize`. `do_finalize` propagated into MoE arg builders and per-tile launcher config. Launcher run now returns arrays (single-element final or multi-element intermediate tensors). Minor NotImplementedError message formatting.
Python Binding / Operator Layer `flashinfer/fused_moe/core.py`	Public op wrappers (BF16/FP8/FP4/MXInt4 and routed variants) updated to accept `do_finalize: bool = True` and return `List[Tensor]`. Autotuning/forward paths and fake-op stubs thread `do_finalize` through tactics and launcher calls; return semantics adjusted to either a single final tensor (in a list) or multiple intermediate tensors.
Tests `tests/moe/test_trtllm_gen_fused_moe.py`, `tests/moe/test_trtllm_gen_routed_fused_moe.py`	Test call sites adjusted to index the first element (`[0]`) of the returned Array/List before casting to match new return structure.
FFI / Config Path `csrc/...` (FFI surface) `trtllm_get_valid_moe_configs`	Error message formatting tweaked (signature unchanged).

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python wrapper
    participant FFI as C++ FFI entry
    participant Launcher as CUDA launcher
    participant GPU as GPU kernels

    Py->>FFI: call trtllm_*_moe(..., do_finalize)
    FFI->>Launcher: build args + per-tile launchers, pass do_finalize
    Launcher->>GPU: run kernels (produce final or intermediate tensors)
    GPU-->>Launcher: return Array\<Tensor\>
    Launcher-->>FFI: forward Array\<Tensor\>
    FFI-->>Py: return Array\<Tensor\> (caller may select [0] when finalized)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[feat] Refactor trtllmgen MOE and add Bf16 trtllmgen moe #2014 — Modifies trtllm fused-MoE native/Python entrypoints and kernel launcher, overlapping entry-point/launcher changes.
Update the routing for TRTLLMGEN to support kimi k2 and qwen #1831 — Updates the same trtllm_* MoE signatures and launcher wiring; closely related to signature/argument propagation.
chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379 — Changes fused MoE C++/Python entry points (including FP8 block-scale paths) and launcher integration; likely overlapping concerns.

Suggested reviewers

djmmoss
yzh119
cyx-6
jimmyzho
jiahanc
nv-yunzheq

Poem

🐇 I threaded a flag through tunnel and burrow,
Lists now hop out where single carrots did borrow.
Final or staged, each tensor I store,
I shuffled the bundles and pushed through the door. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (21 files): ⚔️ `benchmarks/routines/flashinfer_benchmark_utils.py` (content) ⚔️ `benchmarks/routines/gemm.py` (content) ⚔️ `csrc/flashinfer_sampling_binding.cu` (content) ⚔️ `csrc/sampling.cu` (content) ⚔️ `csrc/trtllm_fused_moe_kernel_launcher.cu` (content) ⚔️ `flashinfer/__init__.py` (content) ⚔️ `flashinfer/aot.py` (content) ⚔️ `flashinfer/fused_moe/core.py` (content) ⚔️ `flashinfer/gemm/__init__.py` (content) ⚔️ `flashinfer/gemm/gemm_base.py` (content) ⚔️ `flashinfer/jit/gemm/__init__.py` (content) ⚔️ `flashinfer/jit/gemm/core.py` (content) ⚔️ `flashinfer/sampling.py` (content) ⚔️ `flashinfer/triton/__init__.py` (content) ⚔️ `flashinfer/utils.py` (content) ⚔️ `include/flashinfer/sampling.cuh` (content) ⚔️ `scripts/task_run_unit_tests.sh` (content) ⚔️ `scripts/test_utils.sh` (content) ⚔️ `tests/gemm/test_bmm_fp8.py` (content) ⚔️ `tests/moe/test_trtllm_gen_fused_moe.py` (content) ⚔️ `tests/moe/test_trtllm_gen_routed_fused_moe.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: adding a do_finalize parameter to trtllm-gen fp8/f16 MoE APIs to align with fp4 APIs.
Description check	✅ Passed	The PR description clearly explains the changes: adding do_finalize parameter to align APIs, changing return types from Tensor to List[Tensor], and fixing output buffer allocation bug. Pre-commit checks are marked complete and tests updated.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-12T18:50:59Z

Summary of Changes

Hello @IwakuraRein, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the trtllm-gen fp8/f16 Mixture of Experts (MoE) APIs by introducing a do_finalize argument, aligning their behavior with the existing fp4 APIs. This change provides greater control over the MoE operation's output, allowing users to either receive the fully processed, finalized result or access intermediate tensors for custom post-processing. The API signatures and return types have been updated across both the C++ backend and Python frontend to support this new functionality.

Highlights

API Enhancement: Introduced a do_finalize boolean parameter to the trtllm-gen fp8/f16 MoE APIs to control the finalization step of Mixture of Experts (MoE) outputs.
Return Type Change: Modified the return type of the affected MoE APIs from a single torch.Tensor to a List[torch.Tensor], allowing for the return of intermediate results when do_finalize is False.
Flexible Post-processing: Enabled more flexible post-processing workflows for MoE operations by exposing intermediate computation results, aligning the behavior with existing fp4 APIs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/trtllm_fused_moe_kernel_launcher.cu
- Modified trtllm_bf16_moe, trtllm_fp8_per_tensor_scale_moe, and trtllm_fp8_block_scale_moe function signatures to accept a do_finalize boolean parameter and return Array<Tensor>.
- Updated the internal calls to selected_launcher->run to return the full array of tensors instead of just the first element.
- Added the do_finalize argument to the launcher configuration structures.
- Adjusted formatting in TVM_FFI_LOG_AND_THROW message for trtllm_get_valid_moe_configs.
flashinfer/fused_moe/core.py
- Added do_finalize with a default value of True to the Python wrapper functions trtllm_bf16_moe_op, trtllm_fp8_per_tensor_scale_moe_op, trtllm_fp8_block_scale_moe_op, trtllm_fp8_block_scale_routed_moe, and their corresponding fake operations.
- Changed the return type annotations of these functions from torch.Tensor to List[torch.Tensor].
- Implemented conditional return logic: if do_finalize is True, the first element of the intermediate output list is returned; otherwise, a list containing gemm2_output, expert_weights, and expanded_idx_to_permuted_idx is returned.
- Updated docstrings for the affected functions to describe the new do_finalize parameter and the changed return behavior.
- Removed deprecated activation types from docstrings for trtllm_fp8_per_tensor_scale_moe and trtllm_fp4_block_scale_moe.
tests/moe/test_trtllm_gen_fused_moe.py
- Updated calls to call_moe to correctly extract the final output tensor by indexing the returned list (output[0]).
tests/moe/test_trtllm_gen_routed_fused_moe.py
- Updated calls to trtllm_gen_fp8_routed_fused_moe to correctly extract the final output tensor by indexing the returned list (output[0]).

Activity

The pull request was opened by IwakuraRein.
Pre-commit checks were executed and passed successfully.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a do_finalize parameter to the trtllm-gen fp8 and bf16 MoE APIs, aligning them with the existing fp4 API and enabling more flexible post-processing. The changes span both the C++ CUDA kernels and the Python interface, including updates to function signatures, return types, and docstrings.

While the changes for the do_finalize=True path seem correct and are reflected in the updated tests, I've identified several critical issues in the implementation for the do_finalize=False path that will prevent the new feature from working as intended for FP8 MoE variants. Specifically, the do_finalize parameter is ignored in the C++ implementation for FP8, and there's a return value mismatch between the C++ and Python layers for FP8 block scale MoE that will cause a runtime error. I've also suggested adding test cases for the do_finalize=False scenario to help catch such issues.

csrc/trtllm_fused_moe_kernel_launcher.cu

flashinfer/fused_moe/core.py

tests/moe/test_trtllm_gen_fused_moe.py

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

csrc/trtllm_fused_moe_kernel_launcher.cu (3)
655-659: ⚠️ Potential issue | 🔴 Critical

do_finalize is hardcoded to true, overriding the caller's value.

Fp8PerTensorLauncher::prepare_moe() unconditionally sets args->do_finalize = true at line 659, which overwrites the do_finalize value that was set by the caller (e.g., at line 1618). This makes the newly added do_finalize=False path unreachable for FP8 per-tensor scale MoE, silently ignoring the user's request.
Proposed fix
-    args->do_finalize = true;  // FP8 per-tensor scale always finalizes
+    // args->do_finalize is set by the caller
926-930: ⚠️ Potential issue | 🔴 Critical

Same do_finalize override issue as in Fp8PerTensorLauncher.

Fp8BlockScaleLauncher::prepare_moe() hardcodes args->do_finalize = true at line 930, overriding the caller's value (set at line 1716). The new do_finalize parameter for FP8 block-scale MoE is silently ignored.
Proposed fix
-    args->do_finalize = true;
+    // args->do_finalize is set by the caller
984-988: ⚠️ Potential issue | 🔴 Critical

Return element count mismatch with Python caller when do_finalize=False.

Fp8BlockScaleLauncher::run() returns 2 elements {gemm2_output, expanded_idx_to_permuted_idx} when !do_finalize, but the Python caller at core.py lines 1721–1726 unpacks 3 elements:
gemm2_output, _, expanded_idx_to_permuted_idx = intermediate_output
This will crash with a ValueError once the hardcoded do_finalize=true in prepare_moe() is removed. The return should include expert_weights to match the base class pattern and the Python side.
Proposed fix to align with base class and Python expectation
     if (args->do_finalize) {
       return {output};
     }
-    return {gemm2_output, expanded_idx_to_permuted_idx};
+    return {gemm2_output, FusedMoeLauncher::expert_weights, expanded_idx_to_permuted_idx};
flashinfer/fused_moe/core.py (2)
1390-1414: ⚠️ Potential issue | 🟠 Major

Fake op always returns a single-element list, ignoring do_finalize.

When do_finalize=False, the real op returns a 3-element list. The fake op should mirror this behavior for torch.compile tracing to produce correct graph shapes. The same issue applies to _fake_trtllm_fp8_per_tensor_scale_moe (line 1569) and _fake_trtllm_fp8_block_scale_moe (line 1759).
Proposed fix for BF16 fake op (apply similar pattern to other fake ops)
     def _fake_trtllm_bf16_moe(
         ...
         do_finalize: bool = True,
         ...
     ) -> List[torch.Tensor]:
         seq_len = hidden_states.shape[0]
         hidden_size = hidden_states.shape[1]
 
-        return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        if do_finalize:
+            return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        else:
+            return [
+                hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len, top_k], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len * top_k], dtype=torch.int32),
+            ]
2109-2113: ⚠️ Potential issue | 🟠 Major

trtllm_mxint4_block_scale_moe_op returns a bare Tensor instead of List[Tensor].

The function signature declares -> List[torch.Tensor] (line 2017) and the fake op returns a list (line 2143), but the real implementation returns output (a bare tensor) at line 2113. This is inconsistent and could cause type errors downstream.
Proposed fix
-        return output
+        return [output]
tests/moe/test_trtllm_gen_routed_fused_moe.py (1)
339-341: ⚠️ Potential issue | 🔴 Critical

Missing [0] indexing on trtllm_fp8_block_scale_moe reference output — likely runtime crash.

Line 341 calls trtllm_fp8_block_scale_moe(...) and directly chains .to(torch.float), but the same function is now indexed with [0] in test_trtllm_gen_fused_moe.py (line 1039). If the return type was changed to List[Tensor], calling .to(torch.float) on a list will raise an AttributeError.
🐛 Proposed fix
-    ).to(torch.float)
+    )[0].to(torch.float)

🤖 Fix all issues with AI agents

In `@flashinfer/fused_moe/core.py`:
- Around line 1718-1726: The FP8 block-scale path currently returns or unpacks
intermediate_output without initializing expert_weights and assumes
Fp8BlockScaleLauncher::run() yields three values; fix this by matching the C++
change so Fp8BlockScaleLauncher::run() returns (gemm2_output, expert_weights,
expanded_idx_to_permuted_idx) when !do_finalize, then in the Python path where
do_finalize is False unpack intermediate_output into gemm2_output,
expert_weights, expanded_idx_to_permuted_idx (instead of the current two/three
mismatch) and ensure expert_weights is the value coming from that unpack (not an
uninitialized variable) before returning the torch.from_dlpack(gemm2_output),
expert_weights, torch.from_dlpack(expanded_idx_to_permuted_idx).
- Around line 1532-1540: The FP8 per-tensor return path discards the
C++-produced expert_weights by unpacking intermediate_output as (gemm2_output,
_, expanded_idx_to_permuted_idx), leaving the Python-side expert_weights
uninitialized; update the unpack to capture the C++ expert_weights (e.g.,
extract intermediate_output[1]) and convert it to a PyTorch tensor (like the
other dlpack conversions) before returning so the returned expert_weights is the
initialized tensor from the C++ kernel; adjust the non-do_finalize branch
handling of intermediate_output, gemm2_output, expanded_idx_to_permuted_idx and
expert_weights accordingly in fused_moe.core (symbols: intermediate_output,
gemm2_output, expanded_idx_to_permuted_idx, expert_weights, do_finalize).
- Around line 1379-1388: The branch that handles do_finalize=False currently
discards the expert_weights returned by the native routing kernel and instead
returns the locally allocated empty expert_weights, causing callers to receive
uninitialized data; modify the unpacking of intermediate_output so it captures
the C++-returned expert_weights (i.e., change "gemm2_output, _,
expanded_idx_to_permuted_idx = intermediate_output" to unpack the second element
into expert_weights) and return that expert_weights (the one from
intermediate_output) along with torch.from_dlpack(gemm2_output) and
torch.from_dlpack(expanded_idx_to_permuted_idx), rather than the local
placeholder expert_weights.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Line 1039: The implementation of trtllm_mxint4_block_scale_moe_op returns a
single torch.Tensor but is annotated as List[torch.Tensor], causing inconsistent
handling with other _op functions; fix by making the function's return shape
consistent with the others—either change the implementation to return [output]
(wrap the tensor in a list) so it matches trtllm_fp8_block_scale_moe_op,
trtllm_fp8_per_tensor_scale_moe_op, trtllm_fp4_block_scale_moe_op, and
trtllm_bf16_moe_op, or update the function's return type annotation to
torch.Tensor if you prefer single-tensor returns, and then update any
calling/tests to use the consistent access pattern.

flashinfer/fused_moe/core.py

tests/moe/test_trtllm_gen_fused_moe.py

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/trtllm_fused_moe_kernel_launcher.cu (1)

1444-1449: ⚠️ Potential issue | 🟡 Minor

Inconsistent return value when do_finalize=true across launchers.

FP4BlockScaleLauncher::run returns an empty array {} when do_finalize=true (line 1446), while the base class FusedMoeLauncher::run (line 370) and Fp8BlockScaleLauncher::run (line 983) return {output}. Since this PR's goal is to align BF16/FP8 APIs with FP4, the return contract should be consistent. Either all launchers should return the finalized output tensor, or all should return empty (with the caller holding a reference to the output).

🤖 Fix all issues with AI agents

In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 982-986: The derived class Fp8BlockScaleLauncher has a private
TensorView expert_weights that shadows the base
FusedMoeLauncher::expert_weights, so when prepare_routing() allocates/populates
the base-class expert_weights (via workspace.expert_weights) the current return
returns the stale derived member; fix by returning the base-class member
explicitly (e.g., qualify the return as FusedMoeLauncher::expert_weights) or
rename the derived member (e.g., expert_weights_input) and update all uses in
Fp8BlockScaleLauncher (constructor and any references) so the return and any
consumers receive the populated base-class Tensor; ensure prepare_routing() and
any routing/kernel code still write to the base-class expert_weights.

csrc/trtllm_fused_moe_kernel_launcher.cu

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

flashinfer/fused_moe/core.py (2)
1186-1209: ⚠️ Potential issue | 🟡 Minor

Pass do_finalize into the MXINT4 tuning run.

Other variants forward do_finalize, but the MXINT4 branch omits it, so tuning runs the finalized path even when callers request intermediate outputs.
🛠️ Suggested fix
                 kwargs["routed_scaling_factor"],
                 kwargs["routing_method_type"],
+                kwargs["do_finalize"],
                 kwargs["enable_pdl"],
                 output,
                 [-1, -1] if tactic == -1 else tactic,
1391-1414: ⚠️ Potential issue | 🟠 Major

Fake ops should mirror do_finalize return arity.

All three fake ops ignore do_finalize and always return a single output tensor, which can break tracing/shape inference for non-finalized paths that now return 3 tensors.
🧪 Example fix (apply to all three fake ops)
-        return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        if do_finalize:
+            return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        else:
+            return [
+                hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len, top_k], dtype=routing_logits.dtype),
+                hidden_states.new_empty([seq_len * top_k], dtype=torch.int32),
+            ]
Also applies to: 1543-1569, 1729-1759

🤖 Fix all issues with AI agents

In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1974-1979: The free function currently captures result =
selected_launcher->run(config, enable_pdl) then tries to return variables
(gemm2_output, expanded_idx_to_permuted_idx, output) that are protected members
of FusedMoeLauncher and out of scope; change the function to simply return
result directly (i.e. return the value from selected_launcher->run) for both
do_finalize branches so behavior matches
trtllm_bf16_moe/trtllm_fp8_*/trtllm_fp4_block_scale_moe; if the mxint4 path
truly needs to return {output} when finalized, move that special-case logic into
MxInt4BlockScaleLauncher::run() instead of here.
- Around line 1480-1489: Add a missing output TensorView parameter to
trtllm_bf16_moe (and likewise to trtllm_fp8_per_tensor_scale_moe) so the
function can return the final MoE result when do_finalize=true; update the
function signatures to accept TensorView output, pass that output into the
launcher by setting args->output = output (instead of only allocating inside
Bf16MoeLauncher::prepare_moe), and ensure the caller-visible return uses this
provided output (matching the trtllm_fp4_block_scale_moe /
trtllm_mxint4_block_scale_moe pattern).

In `@flashinfer/fused_moe/core.py`:
- Around line 2116-2124: The non-finalized branch returns a locally allocated,
uninitialized expert_weights; instead unpack and use the kernel-produced
expert_weights from intermediate_output. Replace the unpacking "gemm2_output,
expanded_idx_to_permuted_idx = intermediate_output" with "gemm2_output,
expert_weights, expanded_idx_to_permuted_idx = intermediate_output" and return
that expert_weights (not the local variable) alongside
torch.from_dlpack(gemm2_output) and
torch.from_dlpack(expanded_idx_to_permuted_idx) in the else branch.

csrc/trtllm_fused_moe_kernel_launcher.cu

flashinfer/fused_moe/core.py

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

aleozlx · 2026-02-18T09:24:49Z

flashinfer/fused_moe/core.py

        enable_pdl: Optional[bool] = None,
        tune_max_num_tokens: int = 8192,
-    ) -> torch.Tensor:
+    ) -> List[torch.Tensor]:


so i'm a little worried about changing the output type
i wonder if there is a way we could conditionally preserve the old output for compatibility?

@aleozlx We could make it return the single tensor when do_finalize is False, but that would make fp8/bf16 inconsistent with the fp4.

aleozlx

looks great overall

posted one comment

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

yzh119 · 2026-02-19T02:23:24Z

/bot run

flashinfer-bot · 2026-02-19T02:24:26Z

GitLab MR !328 has been created, and the CI pipeline #44336875 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-19T06:31:20Z

[FAILED] Pipeline #44336875: 9/20 passed

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein · 2026-02-19T22:01:28Z

/bot run

flashinfer-bot · 2026-02-19T22:01:42Z

GitLab MR !328 has been updated with latest changes, and the CI pipeline #44402035 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-20T02:07:18Z

[FAILED] Pipeline #44402035: 9/20 passed

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

aleozlx

lgtm

aleozlx · 2026-02-21T00:59:55Z

tests are clean, g/b300 timed out

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein · 2026-02-22T18:44:30Z

/bot run

flashinfer-bot · 2026-02-22T18:45:29Z

GitLab MR !328 has been updated with latest changes, and the CI pipeline #44572775 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-22T22:48:36Z

[FAILED] Pipeline #44572775: 14/20 passed

)  ## 📌 Description Fix unit test failures caused by change in #2548 to the API that now returns a list of tensors. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Updated MoE test implementations to correctly extract return values from method calls.

IwakuraRein added 6 commits February 12, 2026 18:04

bf16 do_finalize

ddd7ada

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fp8 do_finalize

7852cd1

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

minor fix

f1c7f1a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix comments

de4d18a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

per-tensor fp8 do_finalize

42f54f8

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

minor fix

c11d700

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein requested review from cyx-6, djmmoss, jiahanc, jimmyzho, nv-yunzheq and yzh119 as code owners February 12, 2026 18:50

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

csrc/trtllm_fused_moe_kernel_launcher.cu Show resolved Hide resolved

csrc/trtllm_fused_moe_kernel_launcher.cu Show resolved Hide resolved

flashinfer/fused_moe/core.py Show resolved Hide resolved

tests/moe/test_trtllm_gen_fused_moe.py Outdated Show resolved Hide resolved

minor fix

bcfb913

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

flashinfer/fused_moe/core.py Outdated Show resolved Hide resolved

flashinfer/fused_moe/core.py Show resolved Hide resolved

flashinfer/fused_moe/core.py Show resolved Hide resolved

tests/moe/test_trtllm_gen_fused_moe.py Outdated Show resolved Hide resolved

aleozlx added the op: moe label Feb 12, 2026

fix

377733b

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

csrc/trtllm_fused_moe_kernel_launcher.cu Show resolved Hide resolved

mxint4 do_finalize; unify the return structure

17e7de9

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein marked this pull request as draft February 13, 2026 18:25

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

csrc/trtllm_fused_moe_kernel_launcher.cu Outdated Show resolved Hide resolved

csrc/trtllm_fused_moe_kernel_launcher.cu Outdated Show resolved Hide resolved

flashinfer/fused_moe/core.py Show resolved Hide resolved

fix output buffer

e27bfc9

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein marked this pull request as ready for review February 13, 2026 20:58

Merge remote-tracking branch 'upstream/main' into trtllm-moe-route

11966e6

IwakuraRein enabled auto-merge (squash) February 17, 2026 18:17

IwakuraRein added 3 commits February 17, 2026 20:05

Merge remote-tracking branch 'upstream/main' into trtllm-moe-route

81a04a9

typo

5a76fd9

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix

2d6dadd

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

aleozlx reviewed Feb 18, 2026

View reviewed changes

use from_dlpack when expert_weights is not passed

1a3470e

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein added 2 commits February 19, 2026 18:49

fix expert_weights return value

51ead4b

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Merge remote-tracking branch 'upstream/main' into trtllm-moe-route

9bd11e9

revert return type changes

b8fd612

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

aleozlx approved these changes Feb 21, 2026

View reviewed changes

aleozlx added the run-ci label Feb 21, 2026

aleozlx added the ready label Feb 21, 2026

fix

1dc2fbb

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein disabled auto-merge February 21, 2026 01:24

yzh119 approved these changes Feb 21, 2026

View reviewed changes

yzh119 merged commit ec5f250 into flashinfer-ai:main Feb 23, 2026
30 of 37 checks passed

jimmyzho mentioned this pull request Feb 24, 2026

fix: trtllm_mxint4_block_scale_moe unit test to index output list #2627

Merged

5 tasks

Conversation

IwakuraRein commented Feb 12, 2026 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aleozlx Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Feb 19, 2026

Uh oh!

flashinfer-bot commented Feb 19, 2026

Uh oh!

flashinfer-bot commented Feb 19, 2026

Uh oh!

IwakuraRein commented Feb 19, 2026

Uh oh!

flashinfer-bot commented Feb 19, 2026

Uh oh!

flashinfer-bot commented Feb 20, 2026

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Feb 21, 2026

Uh oh!

IwakuraRein commented Feb 22, 2026

IwakuraRein commented Feb 12, 2026 •

edited by yzh119

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading