Merged
Conversation
added 2 commits
December 4, 2025 05:52
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a follow up of #1033 and adds support for K-cache Hadamard transforms on CUDA.
There are various CUDA Hadamard transform implementations on the Internet, but me being me and not liking the addition of external dependencies (or copy pasting a pile of code that I don't understand), I rolled my own. It is possibly not the fastest possible implementation, but based on performance benchmarks (see below) it cannot be totally bad either.
Being able to run PPL calculations on the GPU more quickly, here are some results for a bunch of models. In all cases the V-cache is left as
f16.Qwen3-30B-A3B,
IQ2_XXSquantizationThe
Q6_0result is somewhat peculiar, but Qwen3-30B-A3B does show some strange behavior when it comes to PPL (see #359)Ling-Mini-2.0, Q4_K_M quantization
This model clearly does not like K-cache quantization with less than 8 bpw.
Ministral3-8B-Instruct, Q8_0 quantization
This model does not mind quantized K-cache.
TheDrummer_Tiger-Gemma-12B-v3, IQ4_NL
GLM-4.6, 5.5 bpw Thireus mix
It takes about 17 minutes to run one PPL calculation on my 2x3090 box, so here fewer results.
Performance
Here
sweep-benchresults for theQ8_0quantized Ministral3-8B-Instruct withQ4_0K-cacheNo Hadamard transform
With Hadamard transform
I.e., basically negligible performance impact de to the added Hadamard transform op.
Here
sweep-benchresults for Ling-Mini-2.0, also withQ4_0K-cacheNo Hadamard transform
With Hadamard transform
Here we see basically negligible impact for PP, but nearly 3% drop in TG performance at zero context. My guess is that this is not because of the extra computation, but because of the additional 2 kernel launches per layer, which are not negligible at nearly 500 t/s. This particular model benefited massively from fused operations, increasing performance from about 400 t/s without fusion to 480 t/s with fusion. It is of course possible to think about fused kernels that include the Hadamard transformation, but I leave that for another day.