[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

LiuXiaoxuanPKU · 2024-06-19T19:01:10Z

This PR is the first attempt to use MQA kernel for target model verification, therefore we can remove the overhead of batch expansion. Currently it uses flash attention flash_attn_varlen_func for verification. We can use flashinfer as the next step. The tricky part is to add cuda graph support.

Some difficulties:

We are modifying the _num_computed_tokens, which is a field used by chunked prefill and this needs more attention here.
The current implementation does not handle spec/non-spec requests within the same batch. We assume all requests within the same batch perform speculative decoding.

TODO:

Pass simple TP=1 tests.
Add cuda graph support.

RFC for this PR

…into flashinfer-sd

LiuXiaoxuanPKU · 2024-07-09T22:20:06Z

Some preliminary benchmark results of reducing the scoring time.
All numbers (in ms) are measured with cuda graph support, it's the scoring time for llama-7B model on a single A100.

Batch size	num_speculative_token	MQAScorer (ms)	BatchExpansionTop1Scorer (ms)
4	5	13.7	15.4
8	5	14.8	17.7
16	5	18.4	23.9
32	5	25.5	36.4
64	5	35.1	54.5
128	5	70.9	96.2

LiuXiaoxuanPKU · 2024-07-09T22:23:29Z

This PR is based on #6052, otherwise, it's hard to add the cuda graph support.

jjjjohnson · 2024-07-18T08:23:25Z

The MQAScorer cannot handle the case when proposals.proposal_lens has 0 element in it... It happens when NGramWorker failed to match any token

LiuXiaoxuanPKU · 2024-09-26T07:36:26Z

closed as moved to #8839

JaviS-Rei · 2025-07-30T08:53:45Z

Hello, I wonder whether MQA in 'MQA kernel' is related to MQA in 'MHA, GQA, MQA'？MQA confuses me a lot. Thanks

LiuXiaoxuanPKU added 4 commits June 19, 2024 00:21

wip

ed0ed8f

wip

c1b1e45

sampler

d161898

pass time tests

3f6fbf3

LiuXiaoxuanPKU marked this pull request as draft June 19, 2024 19:04

LiuXiaoxuanPKU added 7 commits June 20, 2024 23:27

minor

2c41396

merge with main

b3ba398

merge

fd6efb7

Merge branch 'flashinfer-sd' of https://github.com/LiuXiaoxuanPKU/vllm …

de91de3

…into flashinfer-sd

merge

e5934a0

Merge branch 'main' into flashinfer-sd

52d5633

merge fix

195ea15

LiuXiaoxuanPKU mentioned this pull request Jun 30, 2024

[Feature]: Request for SmartSpec Method Support #5886

Closed

unify flash attention backend kernel

4a5f7b6

LiuXiaoxuanPKU mentioned this pull request Jul 2, 2024

[Kernel] Unify the kernel used in flash attention backend #6052

Closed

mpjlu mentioned this pull request Jul 8, 2024

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion #6185

Closed

LiuXiaoxuanPKU added 5 commits July 9, 2024 03:53

Merge branch 'main' into flash-attn-unify

0830ea2

fix

e449f00

fix xformer backend

037c634

fix cudagraph

66d3347

Merge branch 'flash-attn-unify' into flashinfer-sd

dda48ae

LiuXiaoxuanPKU added 3 commits July 9, 2024 22:21

cuda graph

ba7e349

disable log

7bf45ef

minor

1650723

LiuXiaoxuanPKU added 3 commits July 12, 2024 15:33

fix ci

b1d2b5c

Merge branch 'flash-attn-unify' into flashinfer-sd

a64f870

Merge branch 'main' into flashinfer-sd

c30e49d

jon-chuang mentioned this pull request Aug 6, 2024

[Feature]: Tree attention about Speculative Decoding #3960

Closed

LiuXiaoxuanPKU mentioned this pull request Sep 26, 2024

[Spec Decode] (1/2) Remove batch expansion #8839

Merged

LiuXiaoxuanPKU closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

Uh oh!

LiuXiaoxuanPKU commented Jun 19, 2024 •

edited

Loading

Uh oh!

LiuXiaoxuanPKU commented Jul 9, 2024 •

edited

Loading

Uh oh!

LiuXiaoxuanPKU commented Jul 9, 2024

Uh oh!

jjjjohnson commented Jul 18, 2024 •

edited

Loading

Uh oh!

LiuXiaoxuanPKU commented Sep 26, 2024

Uh oh!

JaviS-Rei commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

Uh oh!

Conversation

LiuXiaoxuanPKU commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Jul 9, 2024

Uh oh!

jjjjohnson commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Sep 26, 2024

Uh oh!

JaviS-Rei commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LiuXiaoxuanPKU commented Jun 19, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jul 9, 2024 •

edited

Loading

jjjjohnson commented Jul 18, 2024 •

edited

Loading