Skip to content

mmap backed KV cache#290

Closed
saood06 wants to merge 3617 commits intomainfrom
s6/numa_KV
Closed

mmap backed KV cache#290
saood06 wants to merge 3617 commits intomainfrom
s6/numa_KV

Conversation

@saood06
Copy link
Copy Markdown
Collaborator

@saood06 saood06 commented Mar 25, 2025

Port of ggml-org/llama.cpp#11580

I have not used this as I no longer need it ever since the old KV cache is no longer allocated (this helped when both were allocated as it would not ever actually touch the pages of the old KV cache thus allowing me to not page out to disk), but it still doesn't hurt my performance.

Finally deciding to grab the code from my very old local branch and put it here in case it ends up being beneficial to anyone.

This PR always uses the new buffer type for KV cache, as there is no toggle implemented. This can be added if this ends up being useful in some situations, but a loss in others. So far I haven't found a situation where it causes performance loss so far though.

In theory this should be better for NUMA as I do remember noting it caused a more even split of memory usage across the two nodes on my machine.

This also might have the benefit of letting you allocate the full context size of a model only getting performance loss when you actually go over that limit as it will avoid paging until then.

Iwan Kawrakow and others added 30 commits August 1, 2024 09:38
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs,
or 63.3 t/s for q2_K_S.
Quite slow: 43 t/s for a 7B model
PP-512 goes to 473 t/s up from 452 t/s.
Just use the same trick as iq4_k
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
Respectable performance, only slightly slower than Q6_K.
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128.
Someone has screwed up Q6_K TG performance on Metal? With the
cobntinuous "improvements" in ggml I wouldn't be surprised.
Need to look into it later.
I always use cmake, so had forgotten to pay attention to the
Makefile.
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE,
along with GGML_OP_NONE, are all noops. I.e., nothinh happens.
But ggml still has a barrier after them, which wastes time.
The waste is not too bad for large models where computations are
long compared to the time taken for thread synchronization.
But for small models skipping those unnecessary waits makes
a significant difference. E.g., for the 99M TriLMamodel,
TG-500 goes up to 1426 t/s from 1240 t/s.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ikawrakow and others added 24 commits March 18, 2025 15:41
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
as it is not supported.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

* FlashMLA-3: the best of both worlds - CPU only

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This results in GGGGGGGGGGGGG when generating with
mla = 3, fa = 0.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP Gemma3: not working

* gemma3: build_gemma3 seems to be working now

* Revert changes to convert_hf_to_gguf.py

It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding ability to use THP on Linux

* Use the actual page size4 used for mmap also in munmap

* Add -thp to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix it for nth > rk2

* Handle rk2%nth_k != 0

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Improve DeepSeek batched processing speed

* Revert the commented out section in iqk_mul_mat.cpp

It does have some benefit at long contexts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make fused MoE reproducible

As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.

* Slightly better

* Also do it for non-fused mul_mat_id

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Update sweep bench (depracating .jsonl support)

* Fix README.md
…284)

* llama-bench: enable having different number of threads for tg and pp

* Add -tgb to usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
@ikawrakow
Copy link
Copy Markdown
Owner

I think it needs to be ifdef'ed so the code will still build on Windows.

I wouldn't make it the default unconditionally, we should be able to turn it on/off via a command line parameter. It would be also useful of @ubergarm tested performance implications.

Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache.

@saood06
Copy link
Copy Markdown
Collaborator Author

saood06 commented Mar 27, 2025

I think it needs to be ifdef'ed so the code will still build on Windows.

I wouldn't make it the default unconditionally, we should be able to turn it on/off via a command line parameter.

Yes I agree on the needed changes if this is to be merged in, I mainly just remembered I did this, and made a draft PR in case anyone finds it useful.

It would be also useful of @ubergarm tested performance implications.

I'd be interested to know if it affects performance for him, since it doesn't hurt or help my performance anymore.

Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache.

The distribution of the KV cache never resulted in a performance uplift for me (and based on comments in the original PR from both the author and others it didn't affect them). From what I remember it may have allowed me to turn off numa_balancing for my system without a negatively impact (like it may do, my memory and notes aren't very clear). The main reason I used it was it avoided paging to disks because the old MLA implementation still had the large unneeded KV cache.

I do think your concern is valid but in practice this PR doesn't seem to impact performance, and I'm not really sure why it is performance neutral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants