Conversation
|
I remember you mentioned that the Hadamard transform can achieve more precise results. Why is it not enabled by default? What are the trade-offs? |
|
It's slightly slower at inference. |
The main reason it is not on by default is that the code making all necessary checks has not been written. The attention head size must be a power of two, the cache must be quantized, the model cannot use MLA attention, etc. My thinking is that a user who on purpose turns on this feature has understood the use case scenarios and usage constrains (although, judging by some of the issues being posted, the sad reality is that the average user is just copy-pasting commands they have found around the Internet). Also, turning on Hadamard transforms comes with a small performance penalty, so that may not be what the users obsessed with performance would want. The good news is that now with all the TurboQuant hype, many people will learn that one can apply a rotation to the KV cash data, so perhaps usage of this feature will become more common and better informed. |
Well, yes, it need 2 extra operations. That's why it is a separate command line argument so that the better informed user has the option to not use it, given the small performance penalty and very minor quality benefit. |
|
For those big dense models, it has basically been nothing so I just turn it on, even with Q8 cache. Oh and someone is trying to use TurboQuant to quantize entire models. Think there is any merit to it? |
It will be just like any other hype. In a week or two nobody will remember it, and everybody will be chasing the next hype wave. |
|
No real performance loss from normal q8, but better PPL then f16. With ram speed bottleneck at least. PPL at 65536 ctx
With KHAD + VHAD
No khad, no vhad
|
Seems not that much of an improvement? |
|
Well that's just PPL. When it's really high you can use it to understand that something is wrong, and when it's higher in a certain configuration you can guesstimate that that configuration is probably better for your use case. But how much better isn't something you can estimate by a PPL delta. At the end of the day you're predicting Wikipedia text without even an instruct format. Has nothing to do with 99% of use cases. And it's probably far more bottlenecked by my Q5 MoE weights then the kv cache precision. That 0.002 might make it break some use cases. |
|
I wonder if KLD from F16 can also be calculated. That way you don't need the KLD of the full model. |
This PR adds the ability to also use Hadamard transforms for the V cache. As with the K-cache, V-head size must be 64, 128 or 256 (so not applicable to MLA models such as Kimi or GLM-5).
Unlike K-cache, very modest accuracy gains as measured by PPL. Hence, comparisons are left to the curious user.
But, as mainline is "raising the bar" 3.5 months after ik_llama.cpp (see #1033, #1034), I decided to add the ability to use Hadamard-transformed V-cache to avoid any potential claims that mainline's approach is somehow better.
To use Hadamard-transformed V-cache, add
-vhad | --v-cache-hadamardto the command line. The V cache must be quantized for that to make sense, which requires flash attention to be turned on.