Conversation
Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.
Quite slow: 43 t/s for a 7B model
PP-512 goes to 473 t/s up from 452 t/s.
Just use the same trick as iq4_k
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
* iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.
Respectable performance, only slightly slower than Q6_K.
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.
See comments in f3a823c
I always use cmake, so had forgotten to pay attention to the Makefile.
* Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE, along with GGML_OP_NONE, are all noops. I.e., nothinh happens. But ggml still has a barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a significant difference. E.g., for the 99M TriLMamodel, TG-500 goes up to 1426 t/s from 1240 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include * FlashMLA-3: the best of both worlds - CPU only --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP Gemma3: not working * gemma3: build_gemma3 seems to be working now * Revert changes to convert_hf_to_gguf.py It wasn't working, so I guess, it is better to leave the conversion up tp upstream. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Update sweep bench (depracating .jsonl support) * Fix README.md
…284) * llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
I think it needs to be ifdef'ed so the code will still build on Windows. I wouldn't make it the default unconditionally, we should be able to turn it on/off via a command line parameter. It would be also useful of @ubergarm tested performance implications. Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache. |
Yes I agree on the needed changes if this is to be merged in, I mainly just remembered I did this, and made a draft PR in case anyone finds it useful.
I'd be interested to know if it affects performance for him, since it doesn't hurt or help my performance anymore.
The distribution of the KV cache never resulted in a performance uplift for me (and based on comments in the original PR from both the author and others it didn't affect them). From what I remember it may have allowed me to turn off numa_balancing for my system without a negatively impact (like it may do, my memory and notes aren't very clear). The main reason I used it was it avoided paging to disks because the old MLA implementation still had the large unneeded KV cache. I do think your concern is valid but in practice this PR doesn't seem to impact performance, and I'm not really sure why it is performance neutral. |
Port of ggml-org/llama.cpp#11580
I have not used this as I no longer need it ever since the old KV cache is no longer allocated (this helped when both were allocated as it would not ever actually touch the pages of the old KV cache thus allowing me to not page out to disk), but it still doesn't hurt my performance.
Finally deciding to grab the code from my very old local branch and put it here in case it ends up being beneficial to anyone.
This PR always uses the new buffer type for KV cache, as there is no toggle implemented. This can be added if this ends up being useful in some situations, but a loss in others. So far I haven't found a situation where it causes performance loss so far though.
In theory this should be better for NUMA as I do remember noting it caused a more even split of memory usage across the two nodes on my machine.
This also might have the benefit of letting you allocate the full context size of a model only getting performance loss when you actually go over that limit as it will avoid paging until then.