Quantized SDPA by barronalex · Pull Request #1515 · ml-explore/mlx

barronalex · 2024-10-23T03:38:08Z

First pass at adapting @angeloskath's flash attention to support quantized keys and values.

Still needs some optimization work since it's currently faster to write out the quantized_matmuls rather than use this fused version.

E.g. 4 bit on M2 Ultra for L=32768:

Timing sdpa ... 2.51938 msec
Timing quant_sdpa ... 0.97137 msec
Timing attention ... 1.31419 msec
Timing quant_attention ... 0.92342 msec

bghira · 2025-09-18T04:59:05Z

jfyi i have working int8 and int4 quantised attn, MIT licensed.

barronalex force-pushed the q-sdpa branch from 42a638f to 1e0a199 Compare December 5, 2024 19:10

Alex Barron added 2 commits December 6, 2024 00:21

working qsdpa

12a4d89

add test

3507c10

barronalex force-pushed the q-sdpa branch from 1e0a199 to 3507c10 Compare December 6, 2024 08:45

Alex Barron added 3 commits December 6, 2024 01:09

add checks

c89ddf6

cpu fallback

7697046

fix test

82a956c

awni mentioned this pull request Apr 28, 2025

Missing f8 dtypes #1670

Open

CC-Yeh mentioned this pull request Jan 20, 2026

Quantized SDPA #3026

Open

7 tasks

Provide feedback