[API change] Allow using torch.Tensor for scales for trtllm-gen attention by IwakuraRein · Pull Request #2084 · flashinfer-ai/flashinfer

IwakuraRein · 2025-11-13T00:41:14Z

📌 Description

change bmm1_scale and bmm2_scale to Union[float, torch.Tensor]. notice that when using tensor, it must be applied by log2e
remove the bmm1_scale_log2_tensor and bmm2_scale_tensor in the xqa_batch_decode_with_kv_cache_mla
update trtllm-gen FMHA kernels

TODO: do the same refactor for xqa kernels. The support for the device side scales was removed in #2033

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Attention scale parameters now accept either floats or 1-element tensors across prefill, decode and runtime; tensor scales are validated and applied on-device and pointer-backed scale paths are supported.
Chores
- Updated FMHA artifact path and checksum constants; added a public utility import and removed an obsolete inline comment.
Tests
- Updated tests to exercise device/tensor-or-scalar scale flows, removed legacy per-tensor call-site args, and added device-scale parametrization for several test variants.

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai · 2025-11-13T00:41:20Z

Walkthrough

Adds tensor-or-scalar support for attention scaling across Python APIs and C++ FMHA launchers using tvm::ffi::Variant; accepts device-resident scale tensors (passed as float pointers) or host scalars, applies on-device log2e scaling for tensor inputs, and updates FMHA cubin artifact path and checksum constants.

Changes

Cohort / File(s)	Summary
C++ Kernel Launcher Interface Updates `csrc/trtllm_fmha_kernel_launcher.cu`	Added `#include <tvm/ffi/container/variant.h>` and `using tvm::ffi::Variant`; changed launcher signatures to accept `Variant<double, ffi::Tensor>` for bmm scales; added `const float* bmm1_scale_log2_ptr, const float* bmm2_scale_ptr` params; extraction/validation logic for Variant values and wiring into `runner_params` (`scaleSoftmaxLog2Ptr`, `outputScalePtr`).
Ragged / Paged Attention Callsites `csrc/trtllm_fmha_kernel_launcher.cu` (functions: `trtllm_paged_attention_decode`, `trtllm_paged_attention_context`, `trtllm_ragged_attention`, `trtllm_paged_attention_launcher`, `trtllm_ragged_attention_launcher`)	Replaced scalar bmm1/bmm2 with `Variant<double, ffi::Tensor>` in public callsites; extract concrete doubles and optional device pointers; added/adjusted args (e.g., `max_kv_len`, `lse`, `attention_sinks`); updated launcher invocations to prefer pointer-based scales when present.
Python Decode API Updates `flashinfer/decode.py`	Broadened bmm scale parameters to `Union[float, torch.Tensor]` across decode entry points and `_paged_run`; imported `log2e`; validate tensor dtype (float32), compute log2-scaled tensor on-device (multiply by `log2e`), and pass tensors or scalars to C++ paths.
Python Prefill API Updates `flashinfer/prefill.py`	Broadened bmm scale parameters to `Union[float, torch.Tensor]` for prefill/context functions; imported `log2e`; when tensor provided, assert float32 and apply `log2e` scaling on-device while preserving device/dtype; removed `.item()` scalar extraction.
Artifact Metadata `flashinfer/artifacts.py`	Updated `ArtifactPath.TRTLLM_GEN_FMHA` path string and `CheckSumHash.TRTLLM_GEN_FMHA` checksum constant values.
Tests `tests/attention/test_trtllm_gen_mla.py`, `tests/attention/test_trtllm_gen_attention.py`	`test_trtllm_gen_mla.py`: removed construction/passing of on-device log2 scale tensors and corresponding args. `test_trtllm_gen_attention.py`: added `device_scale` parameterization and logic to materialize bmm scales as scalars or CUDA tensors for coverage; updated call sites accordingly.
Minor Header Cleanup `include/flashinfer/trtllm/fmha/kernelParams.h`	Removed an inline TODO comment; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Py as Python API
    participant Bind as FFI Binder (C++)
    participant Launcher as trtllm_*_launcher
    participant Runner as FMHA Runner

    Py->>Bind: call API with bmm1_scale, bmm2_scale (float or Tensor)
    alt Tensor inputs
        note right of Bind `#d6f5d6`: assert dtype float32\ncompute tensor * log2e on device
        Bind->>Launcher: Variant(tensor) + provide bmm1_scale_log2_ptr & bmm2_scale_ptr
    else Scalar inputs
        note right of Bind `#f0f0f0`: keep/convert as double
        Bind->>Launcher: Variant(double) + pass nullptr for scale pointers
    end
    Launcher->>Runner: set runner_params.scaleSoftmaxLog2Ptr & outputScalePtr
    Runner->>Runner: execute FMHA using pointer scales if present else scalar scales

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing extra attention:
- csrc/trtllm_fmha_kernel_launcher.cu — Variant extraction, pointer lifetime/null handling, and wiring into runner_params.
- Python bindings (flashinfer/decode.py, flashinfer/prefill.py) — device/dtype assertions, on-device log2e scaling, and consistent pointer vs scalar branching.
- Tests — ensure new parameterization covers both tensor and scalar paths and removed args are reconciled.

Possibly related PRs

misc: Update artifacts docstring and MetaInfoHash #1967 — Updates to FMHA artifact paths/checksums and related artifacts metadata (strong overlap with artifact changes in this PR).

Suggested reviewers

joker-eph
aleozlx
djmmoss
cyx-6
nvmbreughe
yzh119

Poem

🐇 I hop with scales both small and grand,
Variant in paw and pointer in hand,
log2e twinkles on-device tonight,
Kernels lean in as pointers take flight,
A rabbit cheers — attention tuned right.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely describes the main change: enabling torch.Tensor support for bmm scales in trtllm-gen attention alongside existing float support.
Description check	✅ Passed	The PR description covers the key changes (tensor scale support with log2e application, removal of legacy parameters, kernel updates) and confirms pre-commit checks and testing completed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch trtllm-gen-attention-allow-device-scales

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining why this PR is needed, why this solution was chosen, and what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein · 2025-11-13T19:03:20Z

/bot run

flashinfer-bot · 2025-11-13T19:04:14Z

@IwakuraRein is not authorized to trigger this CI job. cc: @yzh119, @sricketts, @yongwww

jiahanc · 2025-11-13T19:05:50Z

/bot run

flashinfer-bot · 2025-11-13T19:06:25Z

GitLab MR !136 has been created, and the CI pipeline #38436074 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <[email protected]>

flashinfer-bot · 2025-11-13T19:09:30Z

[CANCELING] Pipeline #38436074: canceled

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/decode.py (1)

1883-1901: In-place scale multiply causes drift

Same issue here: bmm1_scale *= log2e updates the caller’s tensor. If the caller caches that buffer (common in decode loops), it compounds every step. Please switch to a non-in-place multiply or clone first. (docs.pytorch.org)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6765cad and 2f39e1f.

📒 Files selected for processing (5)

csrc/trtllm_fmha_kernel_launcher.cu (13 hunks)
flashinfer/artifacts.py (2 hunks)
flashinfer/decode.py (13 hunks)
flashinfer/prefill.py (9 hunks)
include/flashinfer/trtllm/fmha/kernelParams.h (0 hunks)

💤 Files with no reviewable changes (1)

include/flashinfer/trtllm/fmha/kernelParams.h

coderabbitai · 2025-11-13T19:13:08Z

csrc/trtllm_fmha_kernel_launcher.cu

+  auto maybe_bmm2_scale_value = bmm2_scale.as<double>();
+  auto maybe_bmm1_scale_log2_tensor = bmm1_scale.as<ffi::Tensor>();
+  auto maybe_bmm2_scale_tensor = bmm2_scale.as<ffi::Tensor>();
+  TVM_FFI_CHECK(maybe_bmm1_scale_value.has_value() || maybe_bmm1_scale_log2_tensor.has_value(),
+                "bmm1_scale must be either a double or a tensor");
+  TVM_FFI_CHECK(maybe_bmm2_scale_value.has_value() || maybe_bmm2_scale_tensor.has_value(),
+                "bmm2_scale must be either a double or a tensor");
+  double bmm1_scale_value =
+      maybe_bmm1_scale_value.has_value() ? maybe_bmm1_scale_value.value() : 1.0;
+  double bmm2_scale_value =
+      maybe_bmm2_scale_value.has_value() ? maybe_bmm2_scale_value.value() : 1.0;
+  float* bmm1_scale_log2_ptr =
+      maybe_bmm1_scale_log2_tensor.has_value()
+          ? static_cast<float*>(maybe_bmm1_scale_log2_tensor.value().data_ptr())
+          : nullptr;
+  float* bmm2_scale_ptr = maybe_bmm2_scale_tensor.has_value()
+                              ? static_cast<float*>(maybe_bmm2_scale_tensor.value().data_ptr())
+                              : nullptr;


⚠️ Potential issue | 🟠 Major

Guard tensor-based scales with dtype checks

When bmm*_scale comes in as a tensor, we immediately reinterpret the storage as float*. Callers can legally hand us torch.Float16/torch.BFloat16 today, so this reinterpret cast will read garbage and corrupt the softmax/output scales. Please gate the tensor branch with a dtype == dl_float32 check (and emit a clear error otherwise) before taking the pointer, and apply the same fix in the context and ragged code paths.

@@ - float* bmm1_scale_log2_ptr = - maybe_bmm1_scale_log2_tensor.has_value() - ? static_cast<float*>(maybe_bmm1_scale_log2_tensor.value().data_ptr()) - : nullptr; - float* bmm2_scale_ptr = maybe_bmm2_scale_tensor.has_value() - ? static_cast<float*>(maybe_bmm2_scale_tensor.value().data_ptr()) - : nullptr; + float* bmm1_scale_log2_ptr = nullptr; + if (maybe_bmm1_scale_log2_tensor.has_value()) { + TVM_FFI_ICHECK_EQ(maybe_bmm1_scale_log2_tensor.value().dtype(), dl_float32) + << "bmm1_scale tensor must be float32"; + bmm1_scale_log2_ptr = + static_cast<float*>(maybe_bmm1_scale_log2_tensor.value().data_ptr()); + } + float* bmm2_scale_ptr = nullptr; + if (maybe_bmm2_scale_tensor.has_value()) { + TVM_FFI_ICHECK_EQ(maybe_bmm2_scale_tensor.value().dtype(), dl_float32) + << "bmm2_scale tensor must be float32"; + bmm2_scale_ptr = + static_cast<float*>(maybe_bmm2_scale_tensor.value().data_ptr()); + }

Please mirror this guard in trtllm_paged_attention_context and trtllm_ragged_attention.

Also applies to: 338-356, 503-521

🤖 Prompt for AI Agents

csrc/trtllm_fmha_kernel_launcher.cu lines 260-277: when bmm1_scale or bmm2_scale is a tensor the code currently reinterpret_casts data_ptr() to float* without checking dtype which will misread half/bfloat tensors; modify the tensor branch to first check the tensor dtype is float32 (dl_float32) and TVM_FFI_CHECK/throw a clear error if not, then take the data_ptr() as float*; apply the identical dtype-guard and error message to the similar blocks at lines 338-356 and 503-521 and also mirror these dtype guards in the corresponding trtllm_paged_attention_context and trtllm_ragged_attention code paths.

flashinfer/prefill.py

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f39e1f and d8f6387.

📒 Files selected for processing (1)

flashinfer/decode.py (13 hunks)

flashinfer/decode.py

coderabbitai · 2025-11-13T19:16:37Z

flashinfer/decode.py

+        if isinstance(bmm1_scale, torch.Tensor):
+            assert bmm1_scale.dtype == torch.float32
+            bmm1_scale *= log2e
+        if isinstance(bmm2_scale, torch.Tensor):
+            assert bmm2_scale.dtype == torch.float32



⚠️ Potential issue | 🔴 Critical

Don’t mutate caller tensors when applying log2e

Same issue here: bmm1_scale *= log2e alters the input tensor in place, so repeated invocations accumulate the scaling and yield incorrect kernels. Use an out-of-place multiply before launching the kernel.

- if isinstance(bmm1_scale, torch.Tensor): - assert bmm1_scale.dtype == torch.float32 - bmm1_scale *= log2e + if isinstance(bmm1_scale, torch.Tensor): + assert bmm1_scale.dtype == torch.float32 + bmm1_scale = bmm1_scale * log2e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(bmm1_scale, torch.Tensor):

assert bmm1_scale.dtype == torch.float32

bmm1_scale *= log2e

if isinstance(bmm2_scale, torch.Tensor):

assert bmm2_scale.dtype == torch.float32

if isinstance(bmm1_scale, torch.Tensor):

assert bmm1_scale.dtype == torch.float32

bmm1_scale = bmm1_scale * log2e

if isinstance(bmm2_scale, torch.Tensor):

assert bmm2_scale.dtype == torch.float32

🤖 Prompt for AI Agents

In flashinfer/decode.py around lines 2296 to 2301, the code currently does an in-place scale (bmm1_scale *= log2e) which mutates the caller's tensor; change this to an out-of-place multiplication and reassign the result to bmm1_scale (for example use torch.mul or the * operator) so a new tensor is produced on the same dtype/device and the original caller tensor is not modified; ensure the result remains float32 and mirrored onto the correct device; also review nearby bmm2_scale handling and apply the same out-of-place pattern if it will be scaled later.

flashinfer/decode.py

jiahanc · 2025-11-13T19:17:31Z

/bot run

flashinfer-bot · 2025-11-13T19:17:49Z

GitLab MR !136 has been updated with latest changes, and the CI pipeline #38436713 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <[email protected]>

flashinfer-bot · 2025-11-13T19:28:04Z

[CANCELING] Pipeline #38436713: canceled

IwakuraRein · 2025-11-13T20:08:02Z

/bot run

flashinfer-bot · 2025-11-13T20:08:50Z

@IwakuraRein is not authorized to trigger this CI job. cc: @yzh119, @sricketts, @yongwww

jiahanc · 2025-11-13T20:10:56Z

/bot run

flashinfer-bot · 2025-11-13T20:11:58Z

GitLab MR !136 has been updated with latest changes, and the CI pipeline #38440534 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/attention/test_trtllm_gen_attention.py (1)
490-499: Scale conversion logic is correct.

The conditional conversion between tensor and scalar forms based on device_scale correctly handles all cases:

Tensor → scalar when device_scale=False

Scalar → tensor when device_scale=True

Preserves existing type when already in the desired form

This ensures comprehensive test coverage across both FP8 (where scales may be tensors) and FP16/BF16 (where scales are scalars) data types.

Consider extracting this conversion logic to a helper function to reduce duplication with the identical logic at lines 798-807:
def convert_scale_for_test(scale, device_scale, device):
    """Convert scale between tensor and scalar based on test parameter."""
    if isinstance(scale, torch.Tensor) and not device_scale:
        return scale.item()
    elif not isinstance(scale, torch.Tensor) and device_scale:
        return torch.tensor(scale, device=device, dtype=torch.float32)
    return scale
Then use:
bmm1_scale = convert_scale_for_test(q_scale * k_scale * sm_scale, device_scale, GPU_DEVICE)
bmm2_scale = convert_scale_for_test(v_scale / o_scale, device_scale, GPU_DEVICE)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 270087d and 997b913.

📒 Files selected for processing (1)

tests/attention/test_trtllm_gen_attention.py (13 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/attention/test_trtllm_gen_attention.py (1)

flashinfer/decode.py (1)

trtllm_batch_decode_with_kv_cache (2061-2335)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (3)

tests/attention/test_trtllm_gen_attention.py (3)

374-374: Excellent test coverage expansion.

Adding device_scale as a test parameter ensures both device-resident tensor scales and host scalar scales are validated across all test configurations. This comprehensive approach aligns well with the PR's goal of supporting Union[float, torch.Tensor] for scale parameters.

1128-1128: Forward-looking test design.

Setting device_scale=True in this expected-to-fail test ensures that when the head_dim=256 issue is resolved, the test will immediately validate tensor scale support for that configuration. This is good defensive test engineering.

969-969: LGTM: Consistent parameter threading.

The device_scale parameter is correctly added to the test matrix and properly propagated through the function call hierarchy. The default value of False in _test_trtllm_batch_decode (line 650) provides a sensible default for focused test cases.

Also applies to: 650-650, 986-986, 1005-1005

yzh119 · 2025-11-15T03:31:18Z

/bot run

flashinfer-bot · 2025-11-15T03:31:38Z

GitLab MR !136 has been updated with latest changes, and the CI pipeline #38539973 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-15T13:48:21Z

[FAILED] Pipeline #38539973: 6/17 passed

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/attention/test_trtllm_gen_attention.py (1)

830-839: Same conversion logic as prefill path - see comment on lines 457-466.

This conversion logic is identical to the prefill path. Consider using the helper function suggested in the previous comment to eliminate this duplication.

🧹 Nitpick comments (2)

tests/attention/test_trtllm_gen_attention.py (2)
457-466: Consider extracting the scale conversion logic into a helper function.

The conversion logic between tensor and scalar forms for bmm1_scale and bmm2_scale is duplicated in both the prefill path (lines 457-466) and decode path (lines 830-839). Extracting this into a helper function would reduce code duplication and improve maintainability.

Example helper function:
def convert_scale_form(scale, device_scale: bool, device):
    """Convert scale between tensor and scalar forms based on device_scale flag."""
    if isinstance(scale, torch.Tensor) and not device_scale:
        return scale.item()
    elif not isinstance(scale, torch.Tensor) and device_scale:
        return torch.tensor(scale, device=device, dtype=torch.float32)
    return scale
Then use it as:
bmm1_scale = convert_scale_form(q_scale * k_scale * sm_scale, device_scale, GPU_DEVICE)
bmm2_scale = convert_scale_form(v_scale / o_scale, device_scale, GPU_DEVICE)
611-612: Consider explicitly parametrizing device_scale for more comprehensive test coverage.

The test passes kv_dtype == "fp8" as the device_scale argument, which means:

fp8 tests always use device-side scales

non-fp8 tests always use host-side scales

While this aligns with the natural scale types (fp8 quantization produces tensor scales), it limits test coverage. The specialized tests (test_trtllm_batch_decode_bs1, test_trtllm_batch_decode_head_dim_256, test_trtllm_batch_decode_long_sequence_length) parametrize device_scale explicitly with [True, False], providing better coverage of both code paths across all dtypes.

For consistency and completeness, consider whether this test should also parametrize device_scale explicitly, or document why the current approach is intentional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 997b913 and 74d153f.

📒 Files selected for processing (1)

tests/attention/test_trtllm_gen_attention.py (19 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/attention/test_trtllm_gen_attention.py (1)

flashinfer/decode.py (1)

trtllm_batch_decode_with_kv_cache (2061-2335)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (2)

tests/attention/test_trtllm_gen_attention.py (2)

1035-1035: Note: xqa backend converts device scales to host scalars internally.

When backend="xqa" and kv_dtype="fp8", this test creates device-side tensor scales (device_scale=True), but the xqa backend immediately converts them back to host scalars (see flashinfer/decode.py lines 2158-2162). While this doesn't cause incorrect behavior, it does result in unnecessary tensor creation and conversion.

This is expected behavior since device-side scale support for xqa was removed in PR #2033 (as noted in the PR description's TODO). The test still validates that the conversion works correctly.

1057-1057: Good test coverage with explicit device_scale parametrization.

These specialized tests explicitly parametrize device_scale with [True, False], providing comprehensive coverage of both device-side and host-side scale paths across different scenarios (bs1, head_dim=256, and long sequences). This is more thorough than the general test approach and ensures both code paths work correctly for all dtype combinations.

Since these tests use the trtllm-gen backend exclusively (which supports device scales), they avoid the unnecessary tensor-to-scalar conversions that would occur with the xqa backend.

Also applies to: 1126-1126, 1188-1188

jiahanc · 2025-11-17T18:50:42Z

/bot run

flashinfer-bot · 2025-11-17T18:51:35Z

GitLab MR !136 has been updated with latest changes, and the CI pipeline #38646833 is currently running. I'll report back once the pipeline job completes.

jiahanc

LGTM

flashinfer-bot · 2025-11-18T01:37:26Z

[SUCCESS] Pipeline #38646833: 10/18 passed

yzh119

Failed UTs are not relevant (will be fixed in #2097) and this PR itself LGTM, thanks for your contributions.

…tion (flashinfer-ai#2084)  - change `bmm1_scale` and `bmm2_scale` to `Union[float, torch.Tensor]`. notice that when using tensor, it must be applied by log2e - **remove the `bmm1_scale_log2_tensor` and `bmm2_scale_tensor` in the `xqa_batch_decode_with_kv_cache_mla`** - update trtllm-gen FMHA kernels TODO: do the same refactor for xqa kernels. The support for the device side scales was removed in flashinfer-ai#2033  Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.).   * **New Features** * Attention scale parameters now accept either floats or 1-element tensors across prefill, decode and runtime; tensor scales are validated and applied on-device and pointer-backed scale paths are supported. * **Chores** * Updated FMHA artifact path and checksum constants; added a public utility import and removed an obsolete inline comment. * **Tests** * Updated tests to exercise device/tensor-or-scalar scale flows, removed legacy per-tensor call-site args, and added device-scale parametrization for several test variants.  --------- Signed-off-by: Siyuan Fu <[email protected]>

allow using torch.Tensor for scales for trtllm-gen attention

e3ce4cf

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein added 3 commits November 13, 2025 09:29

refactor; multiply log2e at python level

1c8202d

Signed-off-by: Siyuan Fu <[email protected]>

update comment

3e6bb28

Signed-off-by: Siyuan Fu <[email protected]>

update cubins

2f39e1f

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein marked this pull request as ready for review November 13, 2025 19:02

IwakuraRein requested review from aleozlx, cyx-6, djmmoss, jiahanc, joker-eph, nvmbreughe, wenscarl and yzh119 as code owners November 13, 2025 19:02

relax the assersions for xqa

d8f6387

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

update test_trtllm_gen_mla.py

d2e992f

Signed-off-by: Siyuan Fu <[email protected]>

add device scale to unit test

997b913

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

reduce the number of unit tests

74d153f

Signed-off-by: Siyuan Fu <[email protected]>

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

jiahanc approved these changes Nov 18, 2025

View reviewed changes

jiahanc enabled auto-merge (squash) November 18, 2025 01:05

yzh119 approved these changes Nov 18, 2025

View reviewed changes

jiahanc merged commit a9f71bd into main Nov 18, 2025
4 checks passed

jiahanc deleted the trtllm-gen-attention-allow-device-scales branch November 18, 2025 07:53

This was referenced Nov 19, 2025

add tensor scale input for xqa #2110

Merged

feat: Enable API Logging for Better Debugging POC #2108

Merged

feat: add trtllm-gen per-tensor sparseMla kernels. #2138

Merged

fix flaky xqa test #2126

Merged

This was referenced Dec 10, 2025

feat: support variable sequence length in decode kernel of trtllm-gen attention #2125

Merged

Added an initial implementation of Q and KV Cache in fp8 and to use t… #2035

Merged

Rebase FP8 SM100 Cutlass FMHA Attention to main (original PR#1238) #2047

Merged

coderabbitai bot mentioned this pull request Dec 21, 2025

feat: support non-contiguous query for trtllm-gen attention backend #2254

Merged

5 tasks

This was referenced Jan 7, 2026

[TRTLLM-Gen Fmha] add optimized trtllm-gen decode kernels for high throughput + speculative decoding #2265

Merged

feat: add batch_invariant option to trtllm decode functions #2321

Open

This was referenced Jan 30, 2026

feat: Add TRTLLM fmha_v2 library for SM90 attention with Skip-Softmax #2446

Open

fix: RMSNorm/FusedRMSNorm + Quant kernels cuda graph fixes #2459

Open

feat: Add TRTLLM-Gen Skip-Softmax kernels for prefill and decode #2477

Merged

coderabbitai bot mentioned this pull request Feb 12, 2026

feat: Enable TRTLLM-Gen Skip-Softmax attention for MLA #2547

Merged

5 tasks

Conversation

IwakuraRein commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

IwakuraRein commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

jiahanc commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiahanc commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

IwakuraRein commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

jiahanc commented Nov 13, 2025

Uh oh!

flashinfer-bot commented Nov 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Nov 15, 2025

Uh oh!

flashinfer-bot commented Nov 15, 2025

Uh oh!

flashinfer-bot commented Nov 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Nov 17, 2025

Uh oh!

flashinfer-bot commented Nov 17, 2025

Uh oh!

jiahanc left a comment

IwakuraRein commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

yzh119 left a comment •

edited

Loading