refactor: simplify fp4 rmsnorm by yzh119 · Pull Request #2421 · flashinfer-ai/flashinfer

yzh119 · 2026-01-27T10:29:06Z

📌 Description

Remove repetition patterns in cute-dsl based fp4 rmsnorm code.

More specifically:

Use cute.make_rmem_tensor to create register array instead of explicit creating one register for each of them, and using for loop with cutlass.range_constexpr for elementwise operations.
Put common utilitity functions in fp4_common.py

Benchmarks showing there is not performance degradation.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

cc @bkryu

Summary by CodeRabbit

Refactor

Reorganized internal quantization utilities into a shared module to improve code maintainability and reduce duplication. All public APIs remain unchanged and fully compatible.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T10:29:28Z

📝 Walkthrough

Walkthrough

Consolidates FP4 quantization utilities and CuTe-DSL intrinsics into a new shared fp4_common.py module, then refactors add_rmsnorm_fp4quant.py and rmsnorm_fp4quant.py to import and reuse these utilities instead of maintaining duplicated inline definitions, reducing code duplication while preserving public APIs.

Changes

Cohort / File(s)	Summary
New FP4 utilities module `flashinfer/cute_dsl/fp4_common.py`	Introduces ~40+ new public functions: architecture utilities (`get_sm_version`), PTX intrinsics (`set_block_rank`, `store_shared_remote`, `elem_pointer`), global memory ops (`ld_global_v4_u32`, `st_global_u64`), math intrinsics (`rcp_approx_ftz`, `fmin_f32`, `fmax_f32`), half2/bfloat2 SIMD ops, FP8/E4M3/UE8M0 conversions, reduction utilities (warp, block, cluster, row-level), and SF-block processing helpers for quantization workflows.
Kernel refactoring to use fp4_common `flashinfer/cute_dsl/add_rmsnorm_fp4quant.py`, `flashinfer/cute_dsl/rmsnorm_fp4quant.py`	Both files replace local in-file PTX/DSA definitions and helper implementations with imports from `.fp4_common`, delegating memory loads, SIMD operations, reductions, and quantization packing to shared utilities. Public APIs (`AddRMSNormFP4QuantKernel`/`RMSNormFP4QuantKernel`, `add_rmsnorm_fp4quant`/`rmsnorm_fp4quant`, `get_sm_version`) remain unchanged; internal control flow is restructured around shared helpers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: Fused RMSNorm + FP4 Quantization Kernels in CuTe-DSL #2233: Introduces the extraction of FP4/CuTe-DSL intrinsics into a shared module, directly aligning with the refactoring pattern applied to kernel files in this PR.
fix: Add global scale support and optional output allocation for RMSNorm+FP4Quant fusion kernels #2260: Extends the same RMSNorm and AddRMSNorm FP4 quantization kernels with additional parameters (global_scale, optional output allocation), building upon the refactored utility structure.
fix: In-place Residual Update for add_rmsnorm_fp4quant #2385: Modifies the add_rmsnorm_fp4quant kernel semantics for in-place residual updates, directly impacting the same kernel being refactored here.

Suggested reviewers

kaixih
aleozlx
bkryu
jimmyzho

Poem

🐰 Whiskers twitching with glee,
We hops through code, extracting with care,
Common threads bundled, utilities shared,
No logic lost, just cleaner to see—
One helper to rule them all, fair and free! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main refactoring objective: simplifying the FP4 RMSNorm implementation by extracting common utilities and reducing code repetition.
Description check	✅ Passed	The description is mostly complete, covering the changes made (moving utilities to fp4_common.py, using cute.make_rmem_tensor), includes performance benchmark notes, and follows the template structure. Minor: Related Issues section not filled but is non-critical.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-27T10:29:37Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the maintainability and readability of the FP4 RMSNorm implementation by consolidating shared code into a new utility file and abstracting repetitive patterns within the main kernel logic. The refactoring aims to simplify the codebase without affecting the existing performance characteristics.

Highlights

Code Centralization: Common utility functions, PTX intrinsics, and reduction logic for FP4 quantization kernels have been moved from add_rmsnorm_fp4quant.py and rmsnorm_fp4quant.py into a new shared module, fp4_common.py.
Reduced Repetition: The kernel functions in both add_rmsnorm_fp4quant.py and rmsnorm_fp4quant.py have been refactored to use new helper functions (e.g., load_8_half2, half2_mul_8, quantize_and_pack_16) instead of explicit, repetitive code for element-wise operations and register array creation.
Performance Preservation: Benchmarks confirm that these refactoring changes do not introduce any performance degradation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly refactors the FP4 RMSNorm implementation by extracting common utility functions and repetitive code patterns into a new fp4_common.py module. The changes effectively simplify the add_rmsnorm_fp4quant.py and rmsnorm_fp4quant.py files, making them more readable and maintainable. The introduction of helper functions like load_8_half2, compute_y_and_max_abs_f32, and quantize_and_pack_16 successfully reduces code duplication and improves modularity, aligning perfectly with the stated objective of simplifying the code. The removal of explicit register assignments and manual element-wise operations in favor of these helper functions is a great improvement. Benchmarks confirm no performance degradation, which is crucial for such low-level optimizations.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (1)

1134-1141: The docstring promises in-place modification of residual, but this is not guaranteed for 3D inputs.

When input.dim() == 3 and the residual tensor is not contiguous in row-major layout, residual.view(B * S, H).contiguous() creates a copy. The kernel modifies this copy, not the original tensor. Since the function returns only (y_fp4, block_scale) and not the residual, the caller has no way to access the modified value.

To fix:

Update the docstring to clarify that in-place modification only works for 2D inputs or pre-contiguous 3D inputs

Or, reshape without calling .contiguous() (e.g., residual.reshape(B * S, H) when possible), then handle contiguity at the kernel call site

Or, for 3D inputs, copy the result back: residual.copy_(residual_2d.view(B, S, H)) after kernel execution

yzh119 · 2026-01-27T10:56:51Z

/bot run

flashinfer-bot · 2026-01-27T10:57:04Z

GitLab MR !268 has been created, and the CI pipeline #42635293 is currently running. I'll report back once the pipeline job completes.

bkryu

Unit tests are all passing for relevant Blackwell GPUs on SM100/103/120

Thanks @yzh119, this was a much needed cleanup of the initial version of the kernel

upd

a0be698

yzh119 requested review from aleozlx, bkryu, jimmyzho and kaixih as code owners January 27, 2026 10:29

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

bkryu approved these changes Jan 27, 2026

View reviewed changes

yzh119 merged commit 67fc0a1 into flashinfer-ai:main Jan 27, 2026
24 checks passed

coderabbitai bot mentioned this pull request Jan 28, 2026

refactor: refactoring cuda code to cute-dsl (part 1) #2428

Open

5 tasks

coderabbitai bot mentioned this pull request Feb 17, 2026

[Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl #2573

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify fp4 rmsnorm#2421

refactor: simplify fp4 rmsnorm#2421
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
yzh119:refactor-fp4-norm

yzh119 commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

yzh119 commented Jan 27, 2026

Uh oh!

flashinfer-bot commented Jan 27, 2026

Uh oh!

bkryu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzh119 commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Refactor

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Jan 27, 2026

Uh oh!

flashinfer-bot commented Jan 27, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzh119 commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading