mmap backed KV cache by saood06 · Pull Request #290 · ikawrakow/ik_llama.cpp

saood06 · 2025-03-25T19:40:58Z

I have not used this as I no longer need it ever since the old KV cache is no longer allocated (this helped when both were allocated as it would not ever actually touch the pages of the old KV cache thus allowing me to not page out to disk), but it still doesn't hurt my performance.

Finally deciding to grab the code from my very old local branch and put it here in case it ends up being beneficial to anyone.

This PR always uses the new buffer type for KV cache, as there is no toggle implemented. This can be added if this ends up being useful in some situations, but a loss in others. So far I haven't found a situation where it causes performance loss so far though.

In theory this should be better for NUMA as I do remember noting it caused a more even split of memory usage across the two nodes on my machine.

This also might have the benefit of letting you allocate the full context size of a model only getting performance loss when you actually go over that limit as it will avoid paging until then.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.

138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.

We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp

We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.

It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.

Quite slow: 43 t/s for a 7B model

PP-512 goes to 473 t/s up from 452 t/s.

Just use the same trick as iq4_k

There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.

* iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

Respectable performance, only slightly slower than Q6_K.

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

See comments in f3a823c

I always use cmake, so had forgotten to pay attention to the Makefile.

* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE, along with GGML_OP_NONE, are all noops. I.e., nothinh happens. But ggml still has a barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a significant difference. E.g., for the 99M TriLMamodel, TG-500 goes up to 1426 t/s from 1240 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include * FlashMLA-3: the best of both worlds - CPU only --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* WIP Gemma3: not working * gemma3: build_gemma3 seems to be working now * Revert changes to convert_hf_to_gguf.py It wasn't working, so I guess, it is better to leave the conversion up tp upstream. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Update sweep bench (depracating .jsonl support) * Fix README.md

…284) * llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

ikawrakow · 2025-03-27T05:14:15Z

I think it needs to be ifdef'ed so the code will still build on Windows.

I wouldn't make it the default unconditionally, we should be able to turn it on/off via a command line parameter. It would be also useful of @ubergarm tested performance implications.

Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache.

saood06 · 2025-03-27T05:31:58Z

I think it needs to be ifdef'ed so the code will still build on Windows.

I wouldn't make it the default unconditionally, we should be able to turn it on/off via a command line parameter.

Yes I agree on the needed changes if this is to be merged in, I mainly just remembered I did this, and made a draft PR in case anyone finds it useful.

It would be also useful of @ubergarm tested performance implications.

I'd be interested to know if it affects performance for him, since it doesn't hurt or help my performance anymore.

Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache.

The distribution of the KV cache never resulted in a performance uplift for me (and based on comments in the original PR from both the author and others it didn't affect them). From what I remember it may have allowed me to turn off numa_balancing for my system without a negatively impact (like it may do, my memory and notes aren't very clear). The main reason I used it was it avoided paging to disks because the old MLA implementation still had the large unneeded KV cache.

I do think your concern is valid but in practice this PR doesn't seem to impact performance, and I'm not really sure why it is performance neutral.

Iwan Kawrakow and others added 30 commits August 1, 2024 09:38

iq3_k: CUDA dot product

0d19d19

Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.

iq3_k: faster CUDA dot product

fd1ae85

138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.

iq3_k: AVX512 iqk_mul_mat

9d0cf7a

We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp

iq3_k: AVX2 iqk_mul_mat

93d09d1

We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.

iq3_k: NEON

4c2c644

iq3_k: Metal dequantize

30d2d1b

iq2_k: Metal dot product finally works

57df5cc

It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.

iq3_k: Metal dot product

062313d

Quite slow: 43 t/s for a 7B model

iq3_k: slightly faster Metal dequantize kernel

394ed39

PP-512 goes to 473 t/s up from 452 t/s.

iq2/3_k: tiny bit faster Metal dot products

b572dd5

Add copyright notice

3d1446b

iq4_k: speedup quantization by a factor of ~2

e830f4a

iq3_k, iq5_k: faster quantization

6901b3b

Just use the same trick as iq4_k

Update README.md

c11c7c8

There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.

q2_K: allow it to detect ternary nets and quantize accordingly

b409c15

iq6_k: WIP (nothing works)

cfb0410

iq6_k: WIP (quantize/dequantize)

a9b3f4a

iq6_k: CUDA dequantize

c3f5e4d

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

iq6_k: CUDA dot product

050bdfa

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

iq6_k: slightly better Zen4 iqk_mul_mat

595d2ae

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

iq6_k: NEON

48c4389

Respectable performance, only slightly slower than Q6_K.

iq6_k: Metal

a829cb7

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

iq6_k: AVX2

c77dba5

Fix Zen4 implementation of iq3_k, iq4_k, iq5_k

f0d7a0d

See comments in f3a823c

Fix Makefile

f5d1af6

I always use cmake, so had forgotten to pay attention to the Makefile.

Merge mainline - Aug 12 2024 (#17)

8f43e55

* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Update README.md

bb5ff6f

ikawrakow and others added 24 commits March 18, 2025 15:41

Allow q8_0 cache on the CPU for FlashMLA-2 (#265)

8e549b4

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Prevent FlashMLA-1 from running on CUDA (#268)

c3b75c5

as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Honor mmap setting when using tensor overrides (#270)

127c6ee

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Specify tensor name regex for tensors to be repacked (#274)

022660f

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Fix bug: missing parentheses in logical expression (#275)

3d6e25c

This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Fighting with cmake (#279)

13ecc53

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Native build ooption for CUDA when GGML_NATIVE is set (#280)

6028362

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Test transparent huge pages on Linux (#278)

dd5ebd0

* Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Attempt to improve FlashMLA on the CPU (#277)

5a4855e

* Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Improve DeepSeek batched processing speed (#282)

f9307d7

* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

CUDA: better MoE implementation (#283)

98a264a

* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Update sweep bench (depracating .jsonl support) (#289)

279b7d3

* Update sweep bench (depracating .jsonl support) * Fix README.md

llama-bench: enable having different number of threads for tg and pp (#…

a22250d

…284) * llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

NUMA-aware KV cache buffer type (experimental)

e0101cf

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Fixes to make previous commits compile

8ab6b15

More fix

9b6b55f

More fix

c821129

Cleanup

109f5c0

More cleanup

cc8c0e1

Whitespace

f31aca2

saood06 mentioned this pull request Jun 9, 2025

Feature Request: support intel amx for further accelerate #437

Open

4 tasks

saood06 mentioned this pull request Aug 12, 2025

Bug: Periodic crashes with message "Deepseek2 does not support K-shift" #686

Closed

ikawrakow closed this Jan 10, 2026

ikawrakow force-pushed the main branch from 3419c78 to 738dc60 Compare January 10, 2026 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap backed KV cache#290

mmap backed KV cache#290
saood06 wants to merge 3617 commits intomainfrom
s6/numa_KV

saood06 commented Mar 25, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Mar 27, 2025

Uh oh!

saood06 commented Mar 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

saood06 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Mar 27, 2025

Uh oh!

saood06 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saood06 commented Mar 25, 2025 •

edited

Loading

saood06 commented Mar 27, 2025 •

edited

Loading