Skip to content

fix: include fp8_blockscale_gemm_90 in AOT jit-cache#2533

Merged
yongwww merged 1 commit intoflashinfer-ai:mainfrom
Edward-lyz:main
Feb 13, 2026
Merged

fix: include fp8_blockscale_gemm_90 in AOT jit-cache#2533
yongwww merged 1 commit intoflashinfer-ai:mainfrom
Edward-lyz:main

Conversation

@Edward-lyz
Copy link
Contributor

@Edward-lyz Edward-lyz commented Feb 10, 2026

Summary

  • Add fp8_blockscale_gemm_90 (gen_fp8_blockscale_gemm_sm90_module) to the AOT build list when SM90 is enabled.
  • Avoid runtime JIT compilation for fp8_blockscale_gemm_sm90 in environments without CUDA dev headers, which can fail with cublasLt.h not found.

Changes

  • flashinfer/aot.py: append gen_fp8_blockscale_gemm_sm90_module() under add_moe + has_sm90 gating.

Related Issues

Tests

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • [] All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features
    • Added support for FP8 blockscale matrix multiplication operations on SM90 GPU architecture.

## Summary
- Add fp8_blockscale_gemm_90 (gen_fp8_blockscale_gemm_sm90_module) to the AOT build list when SM90 is enabled.
- Avoid runtime JIT compilation for fp8_blockscale_gemm_sm90 in environments without CUDA dev headers, which can fail with cublasLt.h not found.

## Changes
- flashinfer/aot.py: append gen_fp8_blockscale_gemm_sm90_module() under add_moe + has_sm90 gating.

## Related Issues
- Fixes flashinfer-ai#2527
- flashinfer-ai#2527

## Tests
- Not run locally (no SM90 GPU in this environment).
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Edward-lyz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the FlashInfer library by ensuring that the fp8_blockscale_gemm_sm90 module is pre-compiled for SM90-enabled systems. This change eliminates the need for runtime JIT compilation of this specific module, thereby mitigating issues related to missing CUDA development headers and improving overall stability and deployment predictability.

Highlights

  • AOT Compilation: Included the fp8_blockscale_gemm_90 module in the Ahead-Of-Time (AOT) build list for SM90 architectures.
  • Runtime Stability: Prevented potential runtime JIT compilation failures for fp8_blockscale_gemm_sm90 in environments lacking CUDA development headers, specifically addressing cublasLt.h not found errors.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/aot.py
    • Imported gen_fp8_blockscale_gemm_sm90_module from flashinfer.jit.gemm.
    • Appended gen_fp8_blockscale_gemm_sm90_module() to the jit_specs list within the gen_all_modules function, specifically under the has_sm90 conditional block.
Activity
  • Pre-commit checks were installed and run successfully.
  • Tests have been added or updated as needed, and all tests are passing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Added an import and invocation of gen_fp8_blockscale_gemm_sm90_module() to the AOT module generation pipeline in the MOE build path. This ensures FP8 blockscale GEMM kernels for SM90 are included in the precompiled binary cache.

Changes

Cohort / File(s) Summary
FP8 Blockscale GEMM SM90 AOT Integration
flashinfer/aot.py
Added import of gen_fp8_blockscale_gemm_sm90_module and appended its invocation to the JIT specs generation under the MOE/SM90 branch to ensure precompiled kernels are available.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • cyx-6
  • bkryu
  • nvmbreughe

Poem

🐰✨ A blockscale gem, so FP8 and fine,
SM90's kernel joins the cache divine,
No more JIT when binaries align,
The rabbit hops—another build refine! 🔧

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: including fp8_blockscale_gemm_90 in the AOT jit-cache.
Description check ✅ Passed The PR description addresses the core issue with good summary and rationale, though the template structure is partially duplicated.
Linked Issues check ✅ Passed The code change successfully addresses the primary objective from issue #2527 by adding fp8_blockscale_gemm_sm90_module to the AOT build list.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objective: adding fp8_blockscale_gemm_sm90_module to AOT compilation in flashinfer/aot.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds the fp8_blockscale_gemm_90 module to the Ahead-of-Time (AOT) compilation list. This change will prevent runtime JIT compilation failures in environments lacking CUDA development headers, which is a valuable improvement. The implementation is straightforward and correctly places the new module under the add_moe and has_sm90 flags, which is consistent with how other GEMM kernels are handled in the project. The changes look good and address the intended issue effectively.

@yongwww
Copy link
Member

yongwww commented Feb 12, 2026

@flashinfer-bot run

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this fix!

@aleozlx aleozlx self-assigned this Feb 12, 2026
@aleozlx aleozlx added the v0.6.4 release blocker label for v0.6.4 label Feb 12, 2026
@aleozlx
Copy link
Collaborator

aleozlx commented Feb 12, 2026

public ci seems still not started

@aleozlx
Copy link
Collaborator

aleozlx commented Feb 12, 2026

@flashinfer-bot run

@aleozlx aleozlx removed their assignment Feb 12, 2026
@yongwww yongwww merged commit 292f9be into flashinfer-ai:main Feb 13, 2026
34 of 50 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 27, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-ci v0.6.4 release blocker label for v0.6.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug][cubin] Binaries for fp8_blockscale_gemm_sm90 not present in flashinfer-cubin and flashinfer-jit-cache

5 participants