Skip to content

[Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl#2573

Merged
kahyunnam merged 2 commits intoflashinfer-ai:mainfrom
kahyunnam:knam/spark-unit-test-fix
Feb 18, 2026
Merged

[Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl#2573
kahyunnam merged 2 commits intoflashinfer-ai:mainfrom
kahyunnam:knam/spark-unit-test-fix

Conversation

@kahyunnam
Copy link
Collaborator

@kahyunnam kahyunnam commented Feb 17, 2026

📌 Description

Fixes failing test_add_rmsnorm_fp4_quant_cute_dsl unit tests on Spark:

Module import fix for:

... etc ... 
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-512-16] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-512-128] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-1024-1] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-1024-16] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-1024-128] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestOutputBothSFLayouts::test_both_sf_layouts_consistency[dtype0-4096-1] - ModuleNotFoundError: No module named 'flashinfer.cute_dsl.gated_delta_rule'
... etc ... 

Race condition fix for:

E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 490 / 163840 (0.3%)
E       Greatest absolute difference: 4.59375 at index (2, 1201) (up to 0.5 allowed)
E       Greatest relative difference: 299.93060302734375 at index (9, 342) (up to 0.3 allowed)

tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py:813: AssertionError
================================================================================================================ warnings summary ================================================================================================================
../opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:435
  /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:435: UserWarning:
      Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
      Minimum and Maximum cuda capability supported by this version of PyTorch is
      (8.0) - (12.0)

    queued_call()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py::TestLargeHiddenSize::test_large_hidden_nvfp4[dtype0-16384-1024] - AssertionError: Tensor-likes are not close!
=================================================================================================== 1 failed, 1187 passed, 1 warning in 28.85s ===================================================================================================

New test results after this PR applied:

(py312) root@94210fbca355:/workspace# pytest  tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py
============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /workspace
configfile: pytest.ini
collected 1188 items

tests/norm/test_add_rmsnorm_fp4_quant_cute_dsl.py ........................................................................................................................................................................................ [ 15%]
.......................................................................................................................................................................................................................................... [ 35%]
.......................................................................................................................................................................................................................................... [ 54%]
.......................................................................................................................................................................................................................................... [ 74%]
.......................................................................................................................................................................................................................................... [ 94%]
....................................................................                                                                                                                                                                       [100%]

================================================================================================================ warnings summary ================================================================================================================
../opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:435
  /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:435: UserWarning:
      Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
      Minimum and Maximum cuda capability supported by this version of PyTorch is
      (8.0) - (12.0)

    queued_call()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== 1188 passed, 1 warning in 23.02s ========================================================================================================

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced synchronization in cluster mode to ensure all compute units' global-memory updates are visible before proceeding to the next computation phase, improving correctness under multi-node execution.
  • Chores

    • Removed a deprecated feature from the public API, simplifying exported interfaces and public surface area.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kahyunnam, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical failures in Spark unit tests related to add_rmsnorm_fp4_quant_cute_dsl. It rectifies a ModuleNotFoundError by cleaning up unnecessary imports and resolves a data race condition by introducing proper memory synchronization, thereby enhancing the stability and correctness of the FP4 quantization and RMSNorm operations in a distributed environment.

Highlights

  • Unit Test Fixes: Resolved failing test_add_rmsnorm_fp4_quant_cute_dsl unit tests on Spark, addressing both ModuleNotFoundError and AssertionError due to a race condition.
  • Module Import Cleanup: Removed unused imports and __all__ entries for gated_delta_rule from flashinfer/cute_dsl/__init__.py, which was causing ModuleNotFoundError.
  • Race Condition Resolution: Implemented memory synchronization mechanisms (fence and cluster synchronization) in add_rmsnorm_fp4quant.py to prevent race conditions when operating in cluster mode, ensuring data consistency across CTAs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/cute_dsl/init.py
    • Removed import statements for gated_delta_rule and GatedDeltaRuleKernel.
    • Removed gated_delta_rule and GatedDeltaRuleKernel from the __all__ export list.
  • flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
    • Added cute.arch.fence_acq_rel_gpu(), cute.arch.cluster_arrive_relaxed(), and cute.arch.cluster_wait() calls within the kernel.
    • Introduced conditional execution of synchronization primitives for cluster_n > 1 to ensure global memory writes are visible before proceeding to Phase 3 in cluster mode.
Activity
  • The author, kahyunnam, identified and fixed two distinct issues causing unit test failures: a ModuleNotFoundError and an AssertionError indicative of a race condition.
  • The pull request description includes detailed logs of the failing tests before the fix and successful test runs after the changes were applied, demonstrating the effectiveness of the proposed solution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kahyunnam kahyunnam changed the title fix spark unit test failure for test_add_rmsnorm_fp4_quant_cute_dsl [Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl Feb 17, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Removes gated_delta_rule and GatedDeltaRuleKernel from the public API and adds cluster-mode GPU synchronization (fence + arrive + wait) in the add_rmsnorm_fp4quant kernel after the residual update and before Phase 3 to ensure cross-CTA memory visibility.

Changes

Cohort / File(s) Summary
Gated Delta Rule removal
flashinfer/cute_dsl/__init__.py
Removed imports and exports of gated_delta_rule and GatedDeltaRuleKernel from the module's conditional is_cute_dsl_available() block and __all__.
GPU cluster synchronization
flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
Added cluster-mode synchronization after computing h = input + residual and before Phase 3: fence_acq_rel_cluster(), cluster_arrive_relaxed(), and cluster_wait() when cluster_n > 1.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

op: comm

Suggested reviewers

  • kaixih
  • aleozlx
  • yzh119
  • bkryu

Poem

🐰 I hop through code both bright and spry,
A gate removed, a fence raised high,
CTAs pause, then all align,
Memory shared, the kernels shine,
Hooray — a synced and happy sky!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the bug fix focus: addressing failing Spark unit tests in a specific test module.
Description check ✅ Passed The description includes root causes (module import error and race condition), detailed test failure examples, and passing test results after the fix, demonstrating thorough documentation of the changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !319 has been created, and the CI pipeline #44230550 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves two issues: a ModuleNotFoundError by removing an obsolete import in flashinfer/cute_dsl/__init__.py, and a race condition in flashinfer/cute_dsl/add_rmsnorm_fp4quant.py. The fix for the race condition, which involves adding memory fencing and cluster synchronization, is a crucial correction. The changes are well-commented and directly address the test failures described in the pull request. I've added one suggestion for a potential performance optimization regarding the memory fence scope. Overall, this is a good fix.

@kahyunnam kahyunnam self-assigned this Feb 17, 2026
@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !319 has been updated with latest changes, and the CI pipeline #44234479 is currently running. I'll report back once the pipeline job completes.

@kahyunnam
Copy link
Collaborator Author

Benchmarking results: python3 /workspace/benchmarks/bench_cute_dsl_add_rmsnorm_fp4quant.py

Before fix:

================================================================================
Fused Add + RMSNorm + FP4 Quantization Benchmark
================================================================================
GPU Compute Capability: SM121

Running sanity check...
  OK: (128, 256) - FP4 match 99.9%
  OK: (512, 1024) - FP4 match 99.9%
  OK: (1024, 2048) - FP4 match 99.9%
✓ Confirmed: CuTe-DSL output is equivalent to torch.add + RMSNorm + fp4_quantize


Batch    Hidden   Fused (µs)   BW (GB/s)  Unfused (µs)   Speedup
-------------------------------------------------------------------
1000     1536     61.9         163.0      85.0           1.37x
1000     2048     77.3         174.0      104.4          1.35x
1000     4096     136.4        197.1      183.0          1.34x
1000     8192     244.5        219.9      368.4          1.51x
1000     16384    466.3        230.6      873.5          1.87x
1000     32768    910.8        236.2      1704.0         1.87x
1024     1536     62.6         164.9      86.9           1.39x
1024     2048     79.2         173.8      108.0          1.36x
1024     4096     139.4        197.5      187.9          1.35x
1024     8192     248.6        221.5      374.2          1.51x
1024     16384    478.5        230.2      890.6          1.86x
1024     32768    929.3        237.0      1788.7         1.92x
2048     1536     109.4        188.7      145.1          1.33x
2048     2048     138.4        198.8      187.1          1.35x
2048     4096     249.3        220.9      372.9          1.50x
2048     8192     476.3        231.2      882.7          1.85x
2048     16384    922.7        238.7      1776.4         1.93x
2048     32768    1816.2       242.5      3518.4         1.94x
3000     1536     147.9        204.5      203.9          1.38x
3000     2048     188.1        214.4      262.1          1.39x
3000     4096     355.6        226.8      630.0          1.77x
3000     8192     682.9        236.2      1314.4         1.92x
3000     16384    1336.1       241.4      2590.9         1.94x
3000     32768    2642.5       244.2      5136.4         1.94x
4096     1536     192.8        214.2      269.2          1.40x
4096     2048     248.7        221.4      374.7          1.51x
4096     4096     477.0        230.9      890.3          1.87x
4096     8192     918.9        239.7      1778.3         1.94x
4096     16384    1813.5       242.9      3518.2         1.94x
4096     32768    3592.6       245.2      7007.1         1.95x
5000     1536     231.2        218.0      333.5          1.44x
5000     2048     298.0        225.5      506.2          1.70x
5000     4096     574.3        234.0      1082.8         1.89x
5000     8192     1113.5       241.4      2162.5         1.94x
5000     16384    2215.5       242.7      4288.0         1.94x
5000     32768    4381.1       245.4      8536.3         1.95x
8192     1536     363.6        227.1      653.0          1.80x
8192     2048     476.2        231.2      883.0          1.85x
8192     4096     919.6        239.5      1771.0         1.93x
8192     8192     1808.1       243.6      3507.7         1.94x
8192     16384    3615.0       243.7      7001.8         1.94x
8192     32768    7151.4       246.3      13985.2        1.96x
10000    1536     438.1        230.1      812.7          1.85x
10000    2048     574.4        234.0      1085.9         1.89x
10000    4096     1114.5       241.2      2155.0         1.93x
10000    8192     2211.0       243.2      4271.8         1.93x
10000    16384    4383.9       245.3      8543.4         1.95x
10000    32768    8731.6       246.3      17083.9        1.96x
15000    1536     647.4        233.5      1222.5         1.89x
15000    2048     844.4        238.7      1624.1         1.92x
15000    4096     1658.1       243.2      3213.1         1.94x
15000    8192     3308.3       243.8      6386.1         1.93x
15000    16384    6617.7       243.7      12801.8        1.93x
15000    32768    13095.8      246.3      25608.7        1.96x
16384    1536     704.2        234.5      1334.6         1.90x
16384    2048     919.3        239.5      1770.6         1.93x
16384    4096     1808.9       243.5      3503.9         1.94x
16384    8192     3614.8       243.7      6973.8         1.93x
16384    16384    7226.4       243.8      13986.4        1.94x
16384    32768    14297.8      246.4      27987.6        1.96x
25000    1536     1052.4       239.5      2018.4         1.92x
25000    2048     1386.4       242.4      2684.0         1.94x
25000    4096     2762.9       243.2      5330.0         1.93x
25000    8192     5536.6       242.7      10631.8        1.92x
25000    16384    10945.7      245.6      21319.0        1.95x
25000    32768    21833.3      246.2      42743.6        1.96x
32768    1536     1370.3       241.0      2633.0         1.92x
32768    2048     1808.4       243.5      3504.4         1.94x
32768    4096     3618.5       243.4      6969.7         1.93x
32768    8192     7241.6       243.3      13918.3        1.92x
32768    16384    14338.8      245.7      27938.3        1.95x
32768    32768    28589.0      246.5      56149.8        1.96x
60000    1536     2501.4       241.8      4794.2         1.92x
60000    2048     3319.5       242.9      6394.4         1.93x
60000    4096     6638.5       242.9      12742.4        1.92x
60000    8192     13147.8      245.3      25469.5        1.94x
60000    16384    26240.6      245.8      51303.3        1.96x
60000    32768    52520.3      245.7      104389.8       1.99x
65536    1536     2728.9       242.1      5239.1         1.92x
65536    2048     3629.8       242.7      6982.8         1.92x
65536    4096     7251.4       242.9      13921.3        1.92x
65536    8192     14374.9      245.1      27819.0        1.94x
65536    16384    28672.4      245.8      56087.0        1.96x
65536    32768    58463.5      241.1      112817.2       1.93x

================================================================================
Geomean speedup vs Unfused (add + rmsnorm + fp4_quantize): 1.81x
================================================================================
Benchmark Complete
================================================================================

After fix:

================================================================================
Fused Add + RMSNorm + FP4 Quantization Benchmark
================================================================================
GPU Compute Capability: SM121

Running sanity check...
  OK: (128, 256) - FP4 match 99.8%
  OK: (512, 1024) - FP4 match 99.8%
  OK: (1024, 2048) - FP4 match 99.9%
✓ Confirmed: CuTe-DSL output is equivalent to torch.add + RMSNorm + fp4_quantize


Batch    Hidden   Fused (µs)   BW (GB/s)  Unfused (µs)   Speedup
-------------------------------------------------------------------
1000     1536     62.9         160.4      85.6           1.36x
1000     2048     80.8         166.3      104.5          1.29x
1000     4096     138.0        194.9      186.2          1.35x
1000     8192     245.5        219.0      390.5          1.59x
1000     16384    470.0        228.8      881.9          1.88x
1000     32768    915.2        235.0      1716.3         1.88x
1024     1536     64.2         160.8      89.7           1.40x
1024     2048     83.2         165.4      107.5          1.29x
1024     4096     139.7        197.1      189.6          1.36x
1024     8192     252.1        218.4      386.7          1.53x
1024     16384    486.5        226.4      908.2          1.87x
1024     32768    935.4        235.5      1797.7         1.92x
2048     1536     110.0        187.6      149.1          1.36x
2048     2048     139.5        197.4      190.0          1.36x
2048     4096     252.4        218.1      385.2          1.53x
2048     8192     480.5        229.2      892.8          1.86x
2048     16384    929.6        236.9      1788.1         1.92x
2048     32768    1814.9       242.7      3545.0         1.95x
3000     1536     149.9        201.7      206.3          1.38x
3000     2048     190.0        212.2      267.7          1.41x
3000     4096     362.2        222.6      637.6          1.76x
3000     8192     693.0        232.8      1334.8         1.93x
3000     16384    1341.1       240.5      2595.8         1.94x
3000     32768    2658.8       242.7      5161.1         1.94x
4096     1536     195.6        211.1      274.0          1.40x
4096     2048     251.8        218.6      386.8          1.54x
4096     4096     482.1        228.4      894.5          1.86x
4096     8192     924.9        238.1      1787.3         1.93x
4096     16384    1824.8       241.4      3532.8         1.94x
4096     32768    3602.2       244.5      7035.5         1.95x
5000     1536     233.3        216.1      344.9          1.48x
5000     2048     303.0        221.8      527.9          1.74x
5000     4096     584.7        229.9      1102.7         1.89x
5000     8192     1123.2       239.3      2165.0         1.93x
5000     16384    2221.9       242.0      4300.3         1.94x
5000     32768    4403.5       244.2      8572.8         1.95x
8192     1536     372.1        221.9      662.4          1.78x
8192     2048     481.9        228.5      901.8          1.87x
8192     4096     925.7        237.9      1779.4         1.92x
8192     8192     1822.0       241.7      3519.7         1.93x
8192     16384    3626.7       242.9      7008.1         1.93x
8192     32768    7159.7       246.1      14022.7        1.96x
10000    1536     445.7        226.2      819.9          1.84x
10000    2048     584.4        230.0      1101.6         1.89x
10000    4096     1124.0       239.1      2160.8         1.92x
10000    8192     2218.4       242.3      4269.1         1.92x
10000    16384    4395.3       244.6      8524.9         1.94x
10000    32768    8748.4       245.8      17103.9        1.96x
15000    1536     659.9        229.1      1238.7         1.88x
15000    2048     850.7        237.0      1630.9         1.92x
15000    4096     1665.7       242.1      3211.8         1.93x
15000    8192     3315.9       243.2      6382.4         1.92x
15000    16384    6643.9       242.8      12823.4        1.93x
15000    32768    13109.2      246.1      25621.6        1.95x
16384    1536     719.1        229.7      1351.4         1.88x
16384    2048     926.6        237.6      1779.6         1.92x
16384    4096     1822.2       241.7      3513.1         1.93x
16384    8192     3619.4       243.4      6974.8         1.93x
16384    16384    7251.6       242.9      14006.8        1.93x
16384    32768    14325.1      246.0      28026.6        1.96x
25000    1536     1062.5       237.2      2018.6         1.90x
25000    2048     1389.3       241.8      2680.0         1.93x
25000    4096     2759.3       243.5      5307.9         1.92x
25000    8192     5547.2       242.3      10649.2        1.92x
25000    16384    10941.5      245.7      21337.7        1.95x
25000    32768    21873.0      245.8      42821.0        1.96x
32768    1536     1375.9       240.1      2633.8         1.91x
32768    2048     1823.4       241.5      3510.1         1.93x
32768    4096     3621.2       243.2      6965.6         1.92x
32768    8192     7256.3       242.8      13949.6        1.92x
32768    16384    14367.3      245.2      27951.5        1.95x
32768    32768    28648.4      246.0      56170.2        1.96x
60000    1536     2512.5       240.7      4793.6         1.91x
60000    2048     3324.2       242.6      6384.4         1.92x
60000    4096     6638.6       242.9      12763.3        1.92x
60000    8192     13150.1      245.3      25480.6        1.94x
60000    16384    26252.8      245.7      51313.5        1.95x
60000    32768    52637.1      245.1      104030.7       1.98x
65536    1536     2747.8       240.4      5235.4         1.91x
65536    2048     3627.7       242.8      6972.2         1.92x
65536    4096     7248.8       243.0      13936.3        1.92x
65536    8192     14379.9      245.0      27839.8        1.94x
65536    16384    28683.6      245.7      56139.5        1.96x
65536    32768    58784.9      239.7      113206.8       1.93x

================================================================================
Geomean speedup vs Unfused (add + rmsnorm + fp4_quantize): 1.81x
================================================================================
Benchmark Complete
================================================================================

Copy link
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's wait merging until the internal results come back, although I don't expect any failures.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44234479: 16/20 passed

@kahyunnam
Copy link
Collaborator Author

[FAILED] Pipeline #44234479: 16/20 passed

B300 only failed due to exceeded time, considering this passing

@kahyunnam kahyunnam merged commit 5c48408 into flashinfer-ai:main Feb 18, 2026
27 of 40 checks passed
@kahyunnam kahyunnam deleted the knam/spark-unit-test-fix branch February 18, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants