Conversation
~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.
Interesting, I've always dealt with it by either comparing the second row (as it is generally more stable between runs anyways) or just running a very low context sweep-bench as a warmup |
It does not affect CPU performance. But on CUDA the time it takes to find and load the pre-compiled kernels is not negligible when compared to the time for computing a batch (well, at least for the 8B model I used here). I had noticed this peculiar behavior, but as I have been testing mostly MoE models lately I thought it was somehow related to that (we know MoE models do better with larger u-batches). I'll make the PP warm-up pass optional via a command line argument as for very large models on the CPU it does take some time to process a batch of 512 tokens. |
I just looked back at my notes/logs, it is the first TG for CPU that does vary, and the cause is different as there is corresponding disk activity that is almost certainly to blame (very little but still some, and even a single HDD seek can sometime be seen from the numbers in my experience). I have done GPU speed testing but I generally don't look at the PP results especially not at low contexts so I never reran to see it go away.
Thanks, I was going to suggest that, as that is very true for some of my testing. |


IQX_Kquants offer better quantization quality for the same amount of bits spent compared to k- and i-quants. But on CUDA they are slower for prompt processing (PP) because matrix multiplications are done via dequantize->cuBLAS, so I thought it is time to fix this.This PR adds quantized matrix multiplications, also known as MMQ, for
IQ4_KS.The following graph shows PP performance as a function of the number of tokens in the KV cache
N_KVfor the main branch (black) and the PR (red). Model is LLaMA-3.1-8B-Instruct, GPU is RTX-4080. We see a very nice performance improvement in the range of 25%.Main branch
PR
Are you wondering why PP performance for
N_KV = 0is significantly lower? I did as well, so I checkedllama-sweep-bench, the tool with which the data for this graph is generated. Warm-up is done via a single TG run. I checked that if I add another warn-up run withn_ubatchtokens, performance forN_KV = 0becomes higher thanN_KV = 512as expected. I guess, I will submit a separate PR for that.TG performance is not affected at all by the PR, so no graph for that.