[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish#37054
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where FlashInfer attention backends incorrectly applied quantization scales even when the KV cache was not FP8-quantized, resulting in corrupted output. The fix correctly restricts the application of these scales to cases where the KV cache data type is indeed FP8. The accompanying test modifications, which set mock layer scales to non-unity values, are appropriate for verifying the fix. The changes appear correct and effectively resolve the issue.
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
223a57d to
6b89f72
Compare
0da3fd3 to
4787ef3
Compare
|
Documentation preview: https://vllm--37054.org.readthedocs.build/en/37054/ |
|
@gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses several correctness issues related to FP8 quantization scales in FlashInfer and MLA backends. The changes correctly handle scales for decode paths and disable FP8 support for the broken CUTLASS MLA backend. However, the fix is incomplete as the MLA prefill path for FlashInfer and Triton backends still lacks proper FP8 scale handling, which is a critical issue. I've added a comment with details on the missing fix.
|
This pull request has merge conflicts that must be resolved before it can be |
MatthewBonanni
left a comment
There was a problem hiding this comment.
Thanks for the fix! Just some small comments
22131c1 to
c5a76dd
Compare
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix and thanks for improving the test coverage!
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…_dtype "auto" leading to gibberish (vllm-project#37054) Signed-off-by: Andy Lo <andy@mistral.ai>
Purpose
This PR fixes the following issues:
layer._k_scaleorlayer._v_scaleshould be used, not both. The current implementation sometimes assumeslayer._k_scaleis used, other times assumeslayer._v_scaleorlayer._k_scale * layer._v_scaleis used, which is inconsistent and leads to bad generations.layer._k_scale.layer._v_scaleis completely ignored when using MLA.To summarize the situation of fp8 MLA scales:
q_scale-> Meant for quantizingq_mqa, notq_mha.q_mhacurrently has no corresponding scales and is naively casted to fp8 ifuse_fp8_prefill(code reference).k_scale-> Meant for quantizing kv_latents, notk_mhaorv_mha.k_mhaandv_mhacurrently has no corresponding scales and is naively casted to fp8 ifuse_fp8_prefill(code reference).v_scale-> Completely unused.Test Plan
The current tests are passing because the qkv_scales are mocked to be 1.0, which silently avoids this bug. Updated the tests to remove the assumption that
[qkv]_scalesare 1.0.To assert the
layer._v_scaleis not used in MLA, I set it to NaN in the mla tests to ensure wrong results if they are ever used.Test Result
Updated tests are passing.