Skip to content

chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs#2548

Merged
yzh119 merged 19 commits intoflashinfer-ai:mainfrom
IwakuraRein:trtllm-moe-route
Feb 23, 2026
Merged

chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs#2548
yzh119 merged 19 commits intoflashinfer-ai:mainfrom
IwakuraRein:trtllm-moe-route

Conversation

@IwakuraRein
Copy link
Collaborator

@IwakuraRein IwakuraRein commented Feb 12, 2026

📌 Description

This PR aims to add the do_finalize to the trtllm-gen fp8/f16 MoE APIs to align with fp4 APIs. This also allows flexible post processing of MoE.

Additionally, fix the bug that the output tensors are allocated twice for bf16/fp8 MoE.

The API changes include:

  • A new argument do_finalize
  • Return type: torch.Tensor -> List[torch.Tensor]

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added do_finalize (default: true) to MoE entry points to toggle final vs intermediate outputs.
  • Refactor

    • MoE operations now return lists/arrays of tensors; finalized calls return a single final tensor, non-finalized calls return intermediate components.
  • Tests

    • Tests updated to handle list/array returns (extract finalized tensor where applicable).
  • Documentation

    • Docstrings updated to reflect do_finalize behavior and new return formats.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Entry points for fused MoE now accept a do_finalize flag and return arrays/lists of tensors instead of single tensors; do_finalize is threaded through Python wrappers, FFI entry points, and the CUDA launcher, changing whether final outputs or intermediate tensors are returned across BF16/FP8/FP4/MXInt4 variants.

Changes

Cohort / File(s) Summary
C++ CUDA Kernel Launcher
csrc/trtllm_fused_moe_kernel_launcher.cu
Entry-point signatures changed to return Array<Tensor> and accept do_finalize. do_finalize propagated into MoE arg builders and per-tile launcher config. Launcher run now returns arrays (single-element final or multi-element intermediate tensors). Minor NotImplementedError message formatting.
Python Binding / Operator Layer
flashinfer/fused_moe/core.py
Public op wrappers (BF16/FP8/FP4/MXInt4 and routed variants) updated to accept do_finalize: bool = True and return List[Tensor]. Autotuning/forward paths and fake-op stubs thread do_finalize through tactics and launcher calls; return semantics adjusted to either a single final tensor (in a list) or multiple intermediate tensors.
Tests
tests/moe/test_trtllm_gen_fused_moe.py, tests/moe/test_trtllm_gen_routed_fused_moe.py
Test call sites adjusted to index the first element ([0]) of the returned Array/List before casting to match new return structure.
FFI / Config Path
csrc/... (FFI surface) trtllm_get_valid_moe_configs
Error message formatting tweaked (signature unchanged).

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python wrapper
    participant FFI as C++ FFI entry
    participant Launcher as CUDA launcher
    participant GPU as GPU kernels

    Py->>FFI: call trtllm_*_moe(..., do_finalize)
    FFI->>Launcher: build args + per-tile launchers, pass do_finalize
    Launcher->>GPU: run kernels (produce final or intermediate tensors)
    GPU-->>Launcher: return Array\<Tensor\>
    Launcher-->>FFI: forward Array\<Tensor\>
    FFI-->>Py: return Array\<Tensor\> (caller may select [0] when finalized)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • djmmoss
  • yzh119
  • cyx-6
  • jimmyzho
  • jiahanc
  • nv-yunzheq

Poem

🐇 I threaded a flag through tunnel and burrow,
Lists now hop out where single carrots did borrow.
Final or staged, each tensor I store,
I shuffled the bundles and pushed through the door. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (21 files):

⚔️ benchmarks/routines/flashinfer_benchmark_utils.py (content)
⚔️ benchmarks/routines/gemm.py (content)
⚔️ csrc/flashinfer_sampling_binding.cu (content)
⚔️ csrc/sampling.cu (content)
⚔️ csrc/trtllm_fused_moe_kernel_launcher.cu (content)
⚔️ flashinfer/__init__.py (content)
⚔️ flashinfer/aot.py (content)
⚔️ flashinfer/fused_moe/core.py (content)
⚔️ flashinfer/gemm/__init__.py (content)
⚔️ flashinfer/gemm/gemm_base.py (content)
⚔️ flashinfer/jit/gemm/__init__.py (content)
⚔️ flashinfer/jit/gemm/core.py (content)
⚔️ flashinfer/sampling.py (content)
⚔️ flashinfer/triton/__init__.py (content)
⚔️ flashinfer/utils.py (content)
⚔️ include/flashinfer/sampling.cuh (content)
⚔️ scripts/task_run_unit_tests.sh (content)
⚔️ scripts/test_utils.sh (content)
⚔️ tests/gemm/test_bmm_fp8.py (content)
⚔️ tests/moe/test_trtllm_gen_fused_moe.py (content)
⚔️ tests/moe/test_trtllm_gen_routed_fused_moe.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: adding a do_finalize parameter to trtllm-gen fp8/f16 MoE APIs to align with fp4 APIs.
Description check ✅ Passed The PR description clearly explains the changes: adding do_finalize parameter to align APIs, changing return types from Tensor to List[Tensor], and fixing output buffer allocation bug. Pre-commit checks are marked complete and tests updated.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @IwakuraRein, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the trtllm-gen fp8/f16 Mixture of Experts (MoE) APIs by introducing a do_finalize argument, aligning their behavior with the existing fp4 APIs. This change provides greater control over the MoE operation's output, allowing users to either receive the fully processed, finalized result or access intermediate tensors for custom post-processing. The API signatures and return types have been updated across both the C++ backend and Python frontend to support this new functionality.

Highlights

  • API Enhancement: Introduced a do_finalize boolean parameter to the trtllm-gen fp8/f16 MoE APIs to control the finalization step of Mixture of Experts (MoE) outputs.
  • Return Type Change: Modified the return type of the affected MoE APIs from a single torch.Tensor to a List[torch.Tensor], allowing for the return of intermediate results when do_finalize is False.
  • Flexible Post-processing: Enabled more flexible post-processing workflows for MoE operations by exposing intermediate computation results, aligning the behavior with existing fp4 APIs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/trtllm_fused_moe_kernel_launcher.cu
    • Modified trtllm_bf16_moe, trtllm_fp8_per_tensor_scale_moe, and trtllm_fp8_block_scale_moe function signatures to accept a do_finalize boolean parameter and return Array<Tensor>.
    • Updated the internal calls to selected_launcher->run to return the full array of tensors instead of just the first element.
    • Added the do_finalize argument to the launcher configuration structures.
    • Adjusted formatting in TVM_FFI_LOG_AND_THROW message for trtllm_get_valid_moe_configs.
  • flashinfer/fused_moe/core.py
    • Added do_finalize with a default value of True to the Python wrapper functions trtllm_bf16_moe_op, trtllm_fp8_per_tensor_scale_moe_op, trtllm_fp8_block_scale_moe_op, trtllm_fp8_block_scale_routed_moe, and their corresponding fake operations.
    • Changed the return type annotations of these functions from torch.Tensor to List[torch.Tensor].
    • Implemented conditional return logic: if do_finalize is True, the first element of the intermediate output list is returned; otherwise, a list containing gemm2_output, expert_weights, and expanded_idx_to_permuted_idx is returned.
    • Updated docstrings for the affected functions to describe the new do_finalize parameter and the changed return behavior.
    • Removed deprecated activation types from docstrings for trtllm_fp8_per_tensor_scale_moe and trtllm_fp4_block_scale_moe.
  • tests/moe/test_trtllm_gen_fused_moe.py
    • Updated calls to call_moe to correctly extract the final output tensor by indexing the returned list (output[0]).
  • tests/moe/test_trtllm_gen_routed_fused_moe.py
    • Updated calls to trtllm_gen_fp8_routed_fused_moe to correctly extract the final output tensor by indexing the returned list (output[0]).
Activity
  • The pull request was opened by IwakuraRein.
  • Pre-commit checks were executed and passed successfully.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a do_finalize parameter to the trtllm-gen fp8 and bf16 MoE APIs, aligning them with the existing fp4 API and enabling more flexible post-processing. The changes span both the C++ CUDA kernels and the Python interface, including updates to function signatures, return types, and docstrings.

While the changes for the do_finalize=True path seem correct and are reflected in the updated tests, I've identified several critical issues in the implementation for the do_finalize=False path that will prevent the new feature from working as intended for FP8 MoE variants. Specifically, the do_finalize parameter is ignored in the C++ implementation for FP8, and there's a return value mismatch between the C++ and Python layers for FP8 block scale MoE that will cause a runtime error. I've also suggested adding test cases for the do_finalize=False scenario to help catch such issues.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
csrc/trtllm_fused_moe_kernel_launcher.cu (3)

655-659: ⚠️ Potential issue | 🔴 Critical

do_finalize is hardcoded to true, overriding the caller's value.

Fp8PerTensorLauncher::prepare_moe() unconditionally sets args->do_finalize = true at line 659, which overwrites the do_finalize value that was set by the caller (e.g., at line 1618). This makes the newly added do_finalize=False path unreachable for FP8 per-tensor scale MoE, silently ignoring the user's request.

Proposed fix
-    args->do_finalize = true;  // FP8 per-tensor scale always finalizes
+    // args->do_finalize is set by the caller

926-930: ⚠️ Potential issue | 🔴 Critical

Same do_finalize override issue as in Fp8PerTensorLauncher.

Fp8BlockScaleLauncher::prepare_moe() hardcodes args->do_finalize = true at line 930, overriding the caller's value (set at line 1716). The new do_finalize parameter for FP8 block-scale MoE is silently ignored.

Proposed fix
-    args->do_finalize = true;
+    // args->do_finalize is set by the caller

984-988: ⚠️ Potential issue | 🔴 Critical

Return element count mismatch with Python caller when do_finalize=False.

Fp8BlockScaleLauncher::run() returns 2 elements {gemm2_output, expanded_idx_to_permuted_idx} when !do_finalize, but the Python caller at core.py lines 1721–1726 unpacks 3 elements:

gemm2_output, _, expanded_idx_to_permuted_idx = intermediate_output

This will crash with a ValueError once the hardcoded do_finalize=true in prepare_moe() is removed. The return should include expert_weights to match the base class pattern and the Python side.

Proposed fix to align with base class and Python expectation
     if (args->do_finalize) {
       return {output};
     }
-    return {gemm2_output, expanded_idx_to_permuted_idx};
+    return {gemm2_output, FusedMoeLauncher::expert_weights, expanded_idx_to_permuted_idx};
flashinfer/fused_moe/core.py (2)

1390-1414: ⚠️ Potential issue | 🟠 Major

Fake op always returns a single-element list, ignoring do_finalize.

When do_finalize=False, the real op returns a 3-element list. The fake op should mirror this behavior for torch.compile tracing to produce correct graph shapes. The same issue applies to _fake_trtllm_fp8_per_tensor_scale_moe (line 1569) and _fake_trtllm_fp8_block_scale_moe (line 1759).

Proposed fix for BF16 fake op (apply similar pattern to other fake ops)
     def _fake_trtllm_bf16_moe(
         ...
         do_finalize: bool = True,
         ...
     ) -> List[torch.Tensor]:
         seq_len = hidden_states.shape[0]
         hidden_size = hidden_states.shape[1]
 
-        return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        if do_finalize:
+            return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        else:
+            return [
+                hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len, top_k], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len * top_k], dtype=torch.int32),
+            ]

2109-2113: ⚠️ Potential issue | 🟠 Major

trtllm_mxint4_block_scale_moe_op returns a bare Tensor instead of List[Tensor].

The function signature declares -> List[torch.Tensor] (line 2017) and the fake op returns a list (line 2143), but the real implementation returns output (a bare tensor) at line 2113. This is inconsistent and could cause type errors downstream.

Proposed fix
-        return output
+        return [output]
tests/moe/test_trtllm_gen_routed_fused_moe.py (1)

339-341: ⚠️ Potential issue | 🔴 Critical

Missing [0] indexing on trtllm_fp8_block_scale_moe reference output — likely runtime crash.

Line 341 calls trtllm_fp8_block_scale_moe(...) and directly chains .to(torch.float), but the same function is now indexed with [0] in test_trtllm_gen_fused_moe.py (line 1039). If the return type was changed to List[Tensor], calling .to(torch.float) on a list will raise an AttributeError.

🐛 Proposed fix
-    ).to(torch.float)
+    )[0].to(torch.float)
🤖 Fix all issues with AI agents
In `@flashinfer/fused_moe/core.py`:
- Around line 1718-1726: The FP8 block-scale path currently returns or unpacks
intermediate_output without initializing expert_weights and assumes
Fp8BlockScaleLauncher::run() yields three values; fix this by matching the C++
change so Fp8BlockScaleLauncher::run() returns (gemm2_output, expert_weights,
expanded_idx_to_permuted_idx) when !do_finalize, then in the Python path where
do_finalize is False unpack intermediate_output into gemm2_output,
expert_weights, expanded_idx_to_permuted_idx (instead of the current two/three
mismatch) and ensure expert_weights is the value coming from that unpack (not an
uninitialized variable) before returning the torch.from_dlpack(gemm2_output),
expert_weights, torch.from_dlpack(expanded_idx_to_permuted_idx).
- Around line 1532-1540: The FP8 per-tensor return path discards the
C++-produced expert_weights by unpacking intermediate_output as (gemm2_output,
_, expanded_idx_to_permuted_idx), leaving the Python-side expert_weights
uninitialized; update the unpack to capture the C++ expert_weights (e.g.,
extract intermediate_output[1]) and convert it to a PyTorch tensor (like the
other dlpack conversions) before returning so the returned expert_weights is the
initialized tensor from the C++ kernel; adjust the non-do_finalize branch
handling of intermediate_output, gemm2_output, expanded_idx_to_permuted_idx and
expert_weights accordingly in fused_moe.core (symbols: intermediate_output,
gemm2_output, expanded_idx_to_permuted_idx, expert_weights, do_finalize).
- Around line 1379-1388: The branch that handles do_finalize=False currently
discards the expert_weights returned by the native routing kernel and instead
returns the locally allocated empty expert_weights, causing callers to receive
uninitialized data; modify the unpacking of intermediate_output so it captures
the C++-returned expert_weights (i.e., change "gemm2_output, _,
expanded_idx_to_permuted_idx = intermediate_output" to unpack the second element
into expert_weights) and return that expert_weights (the one from
intermediate_output) along with torch.from_dlpack(gemm2_output) and
torch.from_dlpack(expanded_idx_to_permuted_idx), rather than the local
placeholder expert_weights.

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Line 1039: The implementation of trtllm_mxint4_block_scale_moe_op returns a
single torch.Tensor but is annotated as List[torch.Tensor], causing inconsistent
handling with other _op functions; fix by making the function's return shape
consistent with the others—either change the implementation to return [output]
(wrap the tensor in a list) so it matches trtllm_fp8_block_scale_moe_op,
trtllm_fp8_per_tensor_scale_moe_op, trtllm_fp4_block_scale_moe_op, and
trtllm_bf16_moe_op, or update the function's return type annotation to
torch.Tensor if you prefer single-tensor returns, and then update any
calling/tests to use the consistent access pattern.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_kernel_launcher.cu (1)

1444-1449: ⚠️ Potential issue | 🟡 Minor

Inconsistent return value when do_finalize=true across launchers.

FP4BlockScaleLauncher::run returns an empty array {} when do_finalize=true (line 1446), while the base class FusedMoeLauncher::run (line 370) and Fp8BlockScaleLauncher::run (line 983) return {output}. Since this PR's goal is to align BF16/FP8 APIs with FP4, the return contract should be consistent. Either all launchers should return the finalized output tensor, or all should return empty (with the caller holding a reference to the output).

🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 982-986: The derived class Fp8BlockScaleLauncher has a private
TensorView expert_weights that shadows the base
FusedMoeLauncher::expert_weights, so when prepare_routing() allocates/populates
the base-class expert_weights (via workspace.expert_weights) the current return
returns the stale derived member; fix by returning the base-class member
explicitly (e.g., qualify the return as FusedMoeLauncher::expert_weights) or
rename the derived member (e.g., expert_weights_input) and update all uses in
Fp8BlockScaleLauncher (constructor and any references) so the return and any
consumers receive the populated base-class Tensor; ensure prepare_routing() and
any routing/kernel code still write to the base-class expert_weights.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@IwakuraRein IwakuraRein marked this pull request as draft February 13, 2026 18:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/fused_moe/core.py (2)

1186-1209: ⚠️ Potential issue | 🟡 Minor

Pass do_finalize into the MXINT4 tuning run.

Other variants forward do_finalize, but the MXINT4 branch omits it, so tuning runs the finalized path even when callers request intermediate outputs.

🛠️ Suggested fix
                 kwargs["routed_scaling_factor"],
                 kwargs["routing_method_type"],
+                kwargs["do_finalize"],
                 kwargs["enable_pdl"],
                 output,
                 [-1, -1] if tactic == -1 else tactic,

1391-1414: ⚠️ Potential issue | 🟠 Major

Fake ops should mirror do_finalize return arity.

All three fake ops ignore do_finalize and always return a single output tensor, which can break tracing/shape inference for non-finalized paths that now return 3 tensors.

🧪 Example fix (apply to all three fake ops)
-        return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        if do_finalize:
+            return [hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16)]
+        else:
+            return [
+                hidden_states.new_empty([seq_len, hidden_size], dtype=torch.bfloat16),
+                hidden_states.new_empty([seq_len, top_k], dtype=routing_logits.dtype),
+                hidden_states.new_empty([seq_len * top_k], dtype=torch.int32),
+            ]

Also applies to: 1543-1569, 1729-1759

🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1974-1979: The free function currently captures result =
selected_launcher->run(config, enable_pdl) then tries to return variables
(gemm2_output, expanded_idx_to_permuted_idx, output) that are protected members
of FusedMoeLauncher and out of scope; change the function to simply return
result directly (i.e. return the value from selected_launcher->run) for both
do_finalize branches so behavior matches
trtllm_bf16_moe/trtllm_fp8_*/trtllm_fp4_block_scale_moe; if the mxint4 path
truly needs to return {output} when finalized, move that special-case logic into
MxInt4BlockScaleLauncher::run() instead of here.
- Around line 1480-1489: Add a missing output TensorView parameter to
trtllm_bf16_moe (and likewise to trtllm_fp8_per_tensor_scale_moe) so the
function can return the final MoE result when do_finalize=true; update the
function signatures to accept TensorView output, pass that output into the
launcher by setting args->output = output (instead of only allocating inside
Bf16MoeLauncher::prepare_moe), and ensure the caller-visible return uses this
provided output (matching the trtllm_fp4_block_scale_moe /
trtllm_mxint4_block_scale_moe pattern).

In `@flashinfer/fused_moe/core.py`:
- Around line 2116-2124: The non-finalized branch returns a locally allocated,
uninitialized expert_weights; instead unpack and use the kernel-produced
expert_weights from intermediate_output. Replace the unpacking "gemm2_output,
expanded_idx_to_permuted_idx = intermediate_output" with "gemm2_output,
expert_weights, expanded_idx_to_permuted_idx = intermediate_output" and return
that expert_weights (not the local variable) alongside
torch.from_dlpack(gemm2_output) and
torch.from_dlpack(expanded_idx_to_permuted_idx) in the else branch.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@IwakuraRein IwakuraRein marked this pull request as ready for review February 13, 2026 20:58
@IwakuraRein IwakuraRein enabled auto-merge (squash) February 17, 2026 18:17
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
enable_pdl: Optional[bool] = None,
tune_max_num_tokens: int = 8192,
) -> torch.Tensor:
) -> List[torch.Tensor]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm a little worried about changing the output type
i wonder if there is a way we could conditionally preserve the old output for compatibility?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleozlx We could make it return the single tensor when do_finalize is False, but that would make fp8/bf16 inconsistent with the fp4.

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great overall

posted one comment

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@yzh119
Copy link
Collaborator

yzh119 commented Feb 19, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !328 has been created, and the CI pipeline #44336875 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44336875: 9/20 passed

@IwakuraRein
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !328 has been updated with latest changes, and the CI pipeline #44402035 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44402035: 9/20 passed

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aleozlx aleozlx added the run-ci label Feb 21, 2026
@aleozlx
Copy link
Collaborator

aleozlx commented Feb 21, 2026

tests are clean, g/b300 timed out

@aleozlx aleozlx added the ready label Feb 21, 2026
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@IwakuraRein IwakuraRein disabled auto-merge February 21, 2026 01:24
@IwakuraRein
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !328 has been updated with latest changes, and the CI pipeline #44572775 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44572775: 14/20 passed

@yzh119 yzh119 merged commit ec5f250 into flashinfer-ai:main Feb 23, 2026
30 of 37 checks passed
aleozlx pushed a commit that referenced this pull request Feb 27, 2026
)

<!-- .github/pull_request_template.md -->

## 📌 Description

Fix unit test failures caused by change in
#2548 to the API that
now returns a list of tensors.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Updated MoE test implementations to correctly extract return values
from method calls.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants