Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Oct 22, 2025

Purpose

As of #26541, FlashMLA now supports q_len > 1 in the decode pipeline. The get_mla_metadata call was not updated, however, leading to poor performance (and potentially, crashes) in these cases. This PR is a simple bug fix achieving a substantial speedup, especially at small batch sizes.

Note: uses the benchmarks in #26835 (not yet merged)

cc @LucasWilkinson

Test Plan

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/flashmla_bugfix_demo.yaml

Test Result

Batch Size = 1
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000051 |     0.000050 |    1.01x
2          |     0.000051 |     0.000050 |    1.02x
4          |     0.000052 |     0.000048 |    1.10x
8          |     0.000098 |     0.000053 |    1.87x
16         |     0.000192 |     0.000057 |    3.39x
32         |     0.000359 |     0.000067 |    5.35x
64         |     0.000702 |     0.000067 |   10.52x
128        |     0.001350 |     0.000131 |   10.27x
256        |     0.002630 |     0.000257 |   10.23x
512        |     0.005094 |     0.000471 |   10.82x

Batch Size = 2
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000050 |     0.000050 |    1.01x
2          |     0.000057 |     0.000047 |    1.23x
4          |     0.000101 |     0.000052 |    1.94x
8          |     0.000190 |     0.000056 |    3.38x
16         |     0.000348 |     0.000091 |    3.81x
32         |     0.000680 |     0.000065 |   10.39x
64         |     0.001325 |     0.000128 |   10.36x
128        |     0.002601 |     0.000256 |   10.17x
256        |     0.005099 |     0.000487 |   10.48x
512        |     0.009949 |     0.000895 |   11.12x

Batch Size = 4
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000047 |     0.000046 |    1.02x
2          |     0.000053 |     0.000051 |    1.04x
4          |     0.000098 |     0.000054 |    1.81x
8          |     0.000185 |     0.000091 |    2.03x
16         |     0.000360 |     0.000065 |    5.54x
32         |     0.000702 |     0.000126 |    5.56x
64         |     0.001369 |     0.000248 |    5.51x
128        |     0.002692 |     0.000498 |    5.41x
256        |     0.005233 |     0.000955 |    5.48x
512        |     0.010128 |     0.001747 |    5.80x

Batch Size = 8
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000051 |     0.000050 |    1.01x
2          |     0.000090 |     0.000053 |    1.69x
4          |     0.000146 |     0.000090 |    1.62x
8          |     0.000278 |     0.000065 |    4.28x
16         |     0.000547 |     0.000126 |    4.33x
32         |     0.001079 |     0.000247 |    4.36x
64         |     0.002115 |     0.000486 |    4.35x
128        |     0.004116 |     0.000973 |    4.23x
256        |     0.008027 |     0.001878 |    4.27x
512        |     0.015573 |     0.003449 |    4.51x

Batch Size = 16
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000056 |     0.000055 |    1.02x
2          |     0.000101 |     0.000092 |    1.10x
4          |     0.000160 |     0.000065 |    2.45x
8          |     0.000309 |     0.000125 |    2.47x
16         |     0.000601 |     0.000245 |    2.45x
32         |     0.001190 |     0.000481 |    2.47x
64         |     0.002328 |     0.000962 |    2.42x
128        |     0.004619 |     0.001903 |    2.43x
256        |     0.008915 |     0.003644 |    2.45x
512        |     0.016957 |     0.006763 |    2.51x

Batch Size = 32
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000094 |     0.000092 |    1.02x
2          |     0.000181 |     0.000064 |    2.82x
4          |     0.000312 |     0.000123 |    2.53x
8          |     0.000579 |     0.000242 |    2.40x
16         |     0.001082 |     0.000484 |    2.24x
32         |     0.002111 |     0.000965 |    2.19x
64         |     0.004150 |     0.001909 |    2.17x
128        |     0.008164 |     0.003767 |    2.17x
256        |     0.015964 |     0.007230 |    2.21x
512        |        CRASH |     0.013545 |      N/A

Batch Size = 64
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000071 |     0.000069 |    1.02x
2          |     0.000129 |     0.000124 |    1.04x
4          |     0.000257 |     0.000244 |    1.05x
8          |     0.000499 |     0.000476 |    1.05x
16         |     0.000973 |     0.000948 |    1.03x
32         |     0.001925 |     0.001906 |    1.01x
64         |     0.003774 |     0.003772 |    1.00x
128        |     0.007667 |     0.007483 |    1.02x
256        |     0.014950 |     0.014518 |    1.03x
512        |        CRASH |     0.026947 |      N/A

Batch Size = 128
Query Len  | Before (s)   | After (s)    | Speedup 
------------------------------------------------------------
1          |     0.000133 |     0.000130 |    1.03x
2          |     0.000257 |     0.000247 |    1.04x
4          |     0.000494 |     0.000477 |    1.04x
8          |     0.000964 |     0.000945 |    1.02x
16         |     0.001877 |     0.001894 |    0.99x
32         |     0.003683 |     0.003770 |    0.98x
64         |     0.007270 |     0.007508 |    0.97x
128        |     0.014790 |     0.015020 |    0.98x
256        |     0.028690 |     0.029102 |    0.99x
512        |        CRASH |     0.054098 |      N/A

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>
@mergify mergify bot added the v1 label Oct 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in the FlashMLA metadata builder for decode scenarios with a query length greater than one. The change properly calculates num_q_tokens_per_head_k and passes it to get_mla_metadata, which resolves the performance degradation and crashes noted in the description. The provided benchmarks clearly demonstrate the significant speedup achieved by this fix. The implementation is correct and well-targeted. Overall, this is an excellent and important bug fix.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an eval we can run to validate this? I assume we could do deepseek with mtp enabled

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Oct 22, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for tracking this down!

nit: can you make a small note that we use the max but all the query lens should be the same

Signed-off-by: Matthew Bonanni <[email protected]>
@MatthewBonanni
Copy link
Contributor Author

@mgoin will do!
@LucasWilkinson done, thanks!

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (assuming evals path; dont merge till then; but I dont see any reason the wont)

@LucasWilkinson LucasWilkinson merged commit dbfbf9f into vllm-project:main Oct 23, 2025
47 checks passed
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
@MatthewBonanni
Copy link
Contributor Author

@mgoin @LucasWilkinson confirmed evals look good:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.953|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.950|±  |0.0060|

@MatthewBonanni MatthewBonanni deleted the fix_fmla_metadata branch October 23, 2025 20:35
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants