Commit 3c5f872
More Flash Attention improvements (#173)
* FA: slightly faster V*softmax(K*Q)) on Zen4
* FA: it is also faster on AVX2 and ARM_NEON
* Deleted forgotten commented out code
* FA: slightly faster V*softmax(K*Q)) also for fp16 K-cache
* FA: slightly faster V*softmax(K*Q)) on Zen4
We now get 130.9 t/s for a context of 32k tokens.
* FA: don't store sum scaling factor in SIMD registers
* FA: timing
* FA: faster q8_0 cache via run-time-repacking
On Zen4 q8_0 KV-cache now slightly outperforms BF16.
We get 134 t/s for 32k tokens, which is ~30% better than
the main branch, and ~18% better than the last commit.
We simply repack the K-cache to q8_0_r4 before the K*Q
multiplication and use the q8_0_r4 x q8_0_x4 matrix multiplication
template.
* FA: Fix AVX2
* FA: fix ARN_NEON
* FA: vectorize q8_0 -> q8_0_r4 repacking also on NEON
* FA: dedicated mat mul for D = 128 also for ARM_NEON
* FA: turn off performance timer
---------
Co-authored-by: Iwan Kawrakow <[email protected]>1 parent 0b74397 commit 3c5f872
2 files changed
Lines changed: 841 additions & 203 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17471 | 17471 | | |
17472 | 17472 | | |
17473 | 17473 | | |
17474 | | - | |
17475 | | - | |
| 17474 | + | |
| 17475 | + | |
| 17476 | + | |
| 17477 | + | |
| 17478 | + | |
| 17479 | + | |
17476 | 17480 | | |
17477 | 17481 | | |
17478 | 17482 | | |
17479 | 17483 | | |
17480 | 17484 | | |
17481 | | - | |
17482 | | - | |
17483 | | - | |
17484 | | - | |
17485 | | - | |
| 17485 | + | |
| 17486 | + | |
| 17487 | + | |
| 17488 | + | |
| 17489 | + | |
17486 | 17490 | | |
17487 | 17491 | | |
17488 | 17492 | | |
17489 | 17493 | | |
17490 | | - | |
| 17494 | + | |
| 17495 | + | |
17491 | 17496 | | |
17492 | | - | |
| 17497 | + | |
17493 | 17498 | | |
17494 | 17499 | | |
17495 | 17500 | | |
| |||
0 commit comments