Skip to content

Quantized SDPA#1515

Draft
barronalex wants to merge 5 commits intomainfrom
q-sdpa
Draft

Quantized SDPA#1515
barronalex wants to merge 5 commits intomainfrom
q-sdpa

Conversation

@barronalex
Copy link
Copy Markdown
Contributor

First pass at adapting @angeloskath's flash attention to support quantized keys and values.

Still needs some optimization work since it's currently faster to write out the quantized_matmuls rather than use this fused version.

E.g. 4 bit on M2 Ultra for L=32768:

Timing sdpa ... 2.51938 msec
Timing quant_sdpa ... 0.97137 msec
Timing attention ... 1.31419 msec
Timing quant_attention ... 0.92342 msec

@awni awni mentioned this pull request Apr 28, 2025
@bghira
Copy link
Copy Markdown

bghira commented Sep 18, 2025

jfyi i have working int8 and int4 quantised attn, MIT licensed.

@CC-Yeh CC-Yeh mentioned this pull request Jan 20, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants