[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish by andylolu2 · Pull Request #37054 · vllm-project/vllm

andylolu2 · 2026-03-14T15:00:13Z

Purpose

This PR fixes the following issues:

FlashInfer + kv_cache_dtype "auto" generates giberrish when layer._[qkv]_scale != 1.0
- Bug is FI applies the layer._[qkv]_scale unconditionally, even when the QKV values are in unscaled bf16.
- This applies to both normal & MLA attention paths.
KV cache scales not properly handed when using MLA + fp8.
- In MLA, the KV latents necessarily must use the same quantization scale for K & V, so only one of layer._k_scale or layer._v_scale should be used, not both. The current implementation sometimes assumes layer._k_scale is used, other times assumes layer._v_scale or layer._k_scale * layer._v_scale is used, which is inconsistent and leads to bad generations.
- In this PR I choose the only use layer._k_scale. layer._v_scale is completely ignored when using MLA.
The CUTLASS_MLA backend says it supports fp8 kv cache but there's no logic to handle the quantization scales properly, so disabling its support for fp8 for now until it gets implemented.

To summarize the situation of fp8 MLA scales:

q_scale -> Meant for quantizing q_mqa, not q_mha. q_mha currently has no corresponding scales and is naively casted to fp8 if use_fp8_prefill (code reference).
k_scale -> Meant for quantizing kv_latents, not k_mha or v_mha. k_mha and v_mha currently has no corresponding scales and is naively casted to fp8 if use_fp8_prefill (code reference).
v_scale -> Completely unused.

Test Plan

The current tests are passing because the qkv_scales are mocked to be 1.0, which silently avoids this bug. Updated the tests to remove the assumption that [qkv]_scales are 1.0.

To assert the layer._v_scale is not used in MLA, I set it to NaN in the mla tests to ensure wrong results if they are ever used.

Test Result

Updated tests are passing.

gemini-code-assist

Code Review

This pull request addresses a critical bug where FlashInfer attention backends incorrectly applied quantization scales even when the KV cache was not FP8-quantized, resulting in corrupted output. The fix correctly restricts the application of these scales to cases where the KV cache data type is indeed FP8. The accompanying test modifications, which set mock layer scales to non-unity values, are appropriate for verifying the fix. The changes appear correct and effectively resolve the issue.

Signed-off-by: Andy Lo <andy@mistral.ai>

mergify · 2026-03-14T16:28:40Z

Documentation preview: https://vllm--37054.org.readthedocs.build/en/37054/

andylolu2 · 2026-03-14T16:33:50Z

@gemini review

gemini-code-assist

Code Review

This pull request addresses several correctness issues related to FP8 quantization scales in FlashInfer and MLA backends. The changes correctly handle scales for decode paths and disable FP8 support for the broken CUTLASS MLA backend. However, the fix is incomplete as the MLA prefill path for FlashInfer and Triton backends still lacks proper FP8 scale handling, which is a critical issue. I've added a comment with details on the missing fix.

vllm/v1/attention/backends/mla/flashinfer_mla.py

mergify · 2026-03-16T23:26:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @andylolu2.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MatthewBonanni

Thanks for the fix! Just some small comments

vllm/v1/attention/backends/mla/cutlass_mla.py

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Andy Lo <andy@mistral.ai>

tests/v1/attention/test_mla_backends.py

Signed-off-by: Andy Lo <andy@mistral.ai>

MatthewBonanni

LGTM, thanks for the fix and thanks for improving the test coverage!

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

mergify bot added nvidia v1 bug Something isn't working labels Mar 14, 2026

github-project-automation bot added this to NVIDIA Mar 14, 2026

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

andylolu2 added 3 commits March 14, 2026 16:16

FlashInfer + BF16 KV cache bugfix

17c3ecd

Signed-off-by: Andy Lo <andy@mistral.ai>

Add tests and fix FI-MLA path too

4205e22

Signed-off-by: Andy Lo <andy@mistral.ai>

Fix all edge cases

6b89f72

Signed-off-by: Andy Lo <andy@mistral.ai>

andylolu2 force-pushed the andy/fi-bf16kv-bugfix branch from 223a57d to 6b89f72 Compare March 14, 2026 16:16

andylolu2 added 3 commits March 14, 2026 16:18

Fix

59a81cf

Signed-off-by: Andy Lo <andy@mistral.ai>

Fix

371fd63

Signed-off-by: Andy Lo <andy@mistral.ai>

Fix

4787ef3

Signed-off-by: Andy Lo <andy@mistral.ai>

andylolu2 force-pushed the andy/fi-bf16kv-bugfix branch from 0da3fd3 to 4787ef3 Compare March 14, 2026 16:20

andylolu2 changed the title ~~[Bugfix] FlashInfer kv_cache_dtype "auto" generates giberrish when layer._[qkv]_scale != 1.0~~ [Bugfix] Fix KV scales in fp8 MLA & FlashInfer kv_cache_dtype "auto" Mar 14, 2026

Fix

3d160dd

Signed-off-by: Andy Lo <andy@mistral.ai>

andylolu2 changed the title ~~[Bugfix] Fix KV scales in fp8 MLA & FlashInfer kv_cache_dtype "auto"~~ [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" Mar 14, 2026

mergify bot added the documentation Improvements or additions to documentation label Mar 14, 2026

andylolu2 changed the title ~~[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto"~~ [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish Mar 14, 2026

andylolu2 marked this pull request as ready for review March 14, 2026 16:33

andylolu2 requested review from mgoin and pavanimajety as code owners March 14, 2026 16:33

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

vllm/v1/attention/backends/mla/flashinfer_mla.py Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 16, 2026

Merge branch 'main' into andy/fi-bf16kv-bugfix

9d52bf8

mergify bot removed the needs-rebase label Mar 17, 2026

MatthewBonanni reviewed Mar 18, 2026

View reviewed changes

vllm/v1/attention/backends/mla/cutlass_mla.py Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

Comments

c5a76dd

Signed-off-by: Andy Lo <andy@mistral.ai>

andylolu2 force-pushed the andy/fi-bf16kv-bugfix branch from 22131c1 to c5a76dd Compare March 18, 2026 18:09

andylolu2 commented Mar 18, 2026

View reviewed changes

tests/v1/attention/test_mla_backends.py Show resolved Hide resolved

andylolu2 added 2 commits March 18, 2026 18:10

Clean

f6f9ec3

Signed-off-by: Andy Lo <andy@mistral.ai>

Fix tests

39da601

Signed-off-by: Andy Lo <andy@mistral.ai>

MatthewBonanni approved these changes Mar 18, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 18, 2026

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

MatthewBonanni enabled auto-merge (squash) March 18, 2026 20:14

Merge branch 'main' into andy/fi-bf16kv-bugfix

3e0c087

MatthewBonanni merged commit 577df69 into vllm-project:main Mar 18, 2026
58 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 18, 2026

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

61372e7

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

51545f8

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

7734ddb

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

ea0b271

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache…

491bbb5

…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish#37054

[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish#37054
MatthewBonanni merged 12 commits intovllm-project:mainfrom
andylolu2:andy/fi-bf16kv-bugfix

andylolu2 commented Mar 14, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

andylolu2 commented Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MatthewBonanni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andylolu2 commented Mar 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

andylolu2 commented Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andylolu2 commented Mar 14, 2026 •

edited by github-actions bot

Loading