TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) by jesusmb1995 · Pull Request #115 · tetherto/qvac-fabric-llm.cpp

jesusmb1995 · 2026-03-27T19:09:40Z

Summary

Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.

Paper: https://arxiv.org/pdf/2504.19874
Community discussion: ggml-org#20969
Related upstream PR: ggml-org#21038 (graph-level rotation for existing quant types)

How it works

Apply a Hadamard rotation at the graph level to Q, K, V before the KV cache (same approach as upstream #21038)
Compute the L2 norm of each head vector (128 floats) and store it
Quantize each coordinate independently using pre-computed Lloyd-Max codebook centroids via binary search (3 comparisons for TBQ3, 4 for TBQ4)
Bit-pack the indices (3-bit for TBQ3, 4-bit nibbles for TBQ4)

The rotation decorrelates coordinates, making each follow a known Beta distribution. A fixed codebook (no per-block calibration) achieves near-optimal MSE. The rotation cancels in the attention dot product: Q·K^T = (H·Q)·(H·K)^T = Q·K^T.

Type	Bits/val	Block size	Compression vs FP16	Codebook
`pq3_0`	3.25	52 B (48+4)	4.9x	8 centroids
`pq4_0`	4.25	68 B (64+4)	3.8x	16 centroids

What's new vs community state

The upstream discussion and community implementations so far have:

Pseudocode-level implementations or standalone Python scripts
Dense O(d²) random rotation via Gram-Schmidt (slow, 64KB matrix for d=128)
No integration into the llama.cpp type system or KV cache path

This PR adds:

Full llama.cpp integration — new GGML types (pq3_0, pq4_0, TBQ3_0_64, TBQ4_0_64), block structs, type traits, CPU backend registration, KV cache support via --cache-type-k/--cache-type-v
Graph-level Hadamard rotation (optRot) — rotation is a graph-level MUL_MAT on Q/K/V, keeping quantize/dequantize kernels as pure codebook operations. Compatible with all backend optimizations (FA, mmq, fused ops). <2% overhead.
Correct Lloyd-Max codebooks computed from the exact Beta PDF for d=128 and d=64 via scripts/compute_tq_codebooks.py
head_dim=64 auto-detection — user specifies pq3_0/pq4_0 on the CLI; KV cache init automatically selects the d=64 internal type when head_dim=64
Vulkan GPU backend with Flash Attention — inline codebook dequant in FA shader, fused mul_mat_vec, SET_ROWS quantize with binary-search + packed writes, batch matmul (mul_mm) support
GPU fully saturated — TBQ3 runs within 5% of f16 speed with FA enabled (was 4.6x slower without FA)
Tests — unit tests with cross-type invariant checks + perplexity regression test script

Usage

llama-cli -m model.gguf --cache-type-k pq3_0 --cache-type-v tbq3_0
llama-cli -m model.gguf --cache-type-k pq4_0 --cache-type-v tbq4_0
llama-server -m model.gguf --cache-type-k pq4_0 --cache-type-v tbq4_0

Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.

Perplexity results

Vulkan GPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

With Flash Attention enabled (GPU fully saturated, disclaimer a bit noisy need to run for longer):

==========================================
 KV Cache Quantization Perplexity Test
==========================================
 Model:    Mistral-7B-Instruct-v0.3-Q4_K_S.gguf
 Dataset:  wikitext-2-raw/wiki.test.raw
 Chunks:   4
 Context:  128
==========================================

==========================================
 Results Summary
==========================================
  Type              PPL       vs f16       Time
  ----              ---       ------       ----
  tbq3_0         7.0183        1.82%    20.74s (1.2x)
  q4_0           6.8399       -0.77%    17.48s (1.0x)
  pq3_0          7.0780        2.69%    17.31s (1.0x)
  tbq4_0         6.8013       -1.33%    20.24s (1.2x)
  f16            6.8928   (baseline)     17.29s
  q8_0           6.8920       -0.01%    17.27s (1.0x)
  pq4_0          6.8001       -1.34%    17.59s (1.0x)
==========================================
WARNING: tbq4_0 PPL (6.8013) should be <= pq4_0 PPL (6.8001)

q4_0, pq3_0, and pq4_0 remain in the same throughput tier as f16 under Flash Attention. Here, PQ = phase-1 / polar-style quantization (Lloyd-Max codebook + random rotation + dequant); TBQ = phase-1 + phase-2 QJL correction (the full TurboQuant path), so tbq3_0/tbq4_0 are the QJL-corrected variants.
In this snapshot, TBQ variants have higher wall time than PQ/Q4_0, with tbq4_0 close in quality to pq4_0 after QJL correction.

tbq3_0 typically benefits more than tbq4_0 over its PQ counterpart when the residual is meaningful, because 3-bit leaves more coarse structure for QJL to recover; at 4-bit, the codebook already preserves most structure so tbq4_0 has much less clean correction headroom and measured gains are often near-noise.
All quantized types benefit from the Hadamard rotation

Mixed type support (tested on AMD integrated GPU)

==========================================
 Mixed K/V Results Summary
==========================================
  K type       V type              PPL       vs f16       Time
  ------       ------              ---       ------       ----
  tbq3_0       pq3_0            7.0303        1.91%    56.88s (1.8x)
  tbq4_0       pq4_0            6.8159       -1.19%    53.49s (1.7x)
  tbq3_0       q8_0             7.0515        2.22%    52.63s (1.6x)
  tbq4_0       f16              6.7608       -1.99%    58.00s (1.8x)
  q8_0         pq3_0            6.9426        0.64%    31.20s (1.0x)
  f16          pq4_0            6.8901       -0.12%    31.41s (1.0x)
==========================================

Vulkan per-kernel microbenchmarks (GTX 1080 Ti, 524K values, 1000 iterations)

Isolated kernel benchmarks via test-quantize-perf -b vulkan:

	q4_0	pq3_0	pq4_0
quantize (us)	48	73	73
mul_mat (us)	685	389	389

Quantize: TQ is ~1.5x slower than Q4_0 due to codebook binary search + norm computation vs simple min/max scaling. Acceptable — SET_ROWS runs once per token per layer.
Mul_mat: TQ is ~1.8x faster — lower bits-per-value (3.25-4.25 vs 4.5) means less memory traffic on bandwidth-bound 1-row matmuls.
TQ dequant is not a standalone GPU op — it runs inline inside FA and mul_mat_vec shaders.

GPU performance comparison (GTX 1080 Ti, 512 tokens prompt eval)

Config	FA	Splits	tok/s	vs f16
f16 baseline	enabled	2	133	1.00x
Q4_0 + optRot	enabled	2	131	0.98x
TBQ3 + optRot	enabled	2	127	0.95x
TBQ3 (before FA fix)	disabled	66	29	0.22x

CPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

  Type              PPL       vs f16
  q4_0           6.8381       -0.66%
  pq4_0          6.8124       -1.03%
  f16            6.8834   (baseline)
  q8_0           6.9026        0.28%
  pq3_0          6.9917        1.57%

CPU: Llama-3.2-1B-Instruct (Q4_0 weights, head_dim=64), wikitext-2, 4 chunks × 128 ctx

Note: d=64 support is preliminary. QJL Stage 2 (not yet implemented) is critical for d=64 quality.

  Type              PPL       vs f16
  pq3_0         16.9446       31.35%
  q4_0          14.5194       12.55%
  pq4_0         13.7621        6.68%
  f16           12.9001   (baseline)
  q8_0          12.9042        0.03%

Vulkan performance optimization details

SET_ROWS (K-cache write quantize kernel)

Stage	SET_ROWS (us)	vs Q4_0
Original (FHT inside kernel)	88	9.6x slower
+ optRot (Hadamard removed from kernel)	44	4.8x slower
+ binary search + packed 3-bit writes	19	2.1x slower
Q4_0 baseline	9	1.0x

Flash Attention (the critical optimization)

Without FA, TBQ3 was 4.6x slower than f16 due to 66 graph splits (each a GPU sync point) and explicit dequant-then-matmul fallback. Adding inline codebook dequant to the FA shader brought this to 5% overhead:

flash_attn_base.glsl: inline dequantize4() for TBQ3 (3-bit unpack + codebook lookup) and TBQ4 (nibble unpack + codebook lookup)
Since optRot moved the Hadamard to graph level, FA dequant is just a codebook lookup — same complexity as Q4_0's linear dequant

Limitations

head_dim must be 64 or 128. Codebooks and Hadamard transform are pre-computed for these dimensions.
Stage 1 only (MSE quantizer) — the paper's quality neutrality at 3.5 bpw requires Stage 2 (QJL), not yet implemented
d=64 quality is poor — +31% TBQ3 / +7% TBQ4 regression without QJL Stage 2
Metal shaders not yet implemented

TODOs

QJL Stage 2 implementation status

Path	QJL Encode	QJL Correction	Status
CPU quantize (d=128 & d=64)	YES	—	Complete
CPU vec_dot (d=128 & d=64)	—	YES	Complete
CPU dequantize	—	— (by design)	Complete
GPU `cpy_f32_quant`	YES	—	Complete
GPU flash attention	—	YES	Complete
GPU dequant	—	— (by design)	Complete
GPU `mul_mat_vec`	—	NO	TODO

Note: GPU mul_mat_vec does not apply QJL correction — it uses Stage 1 only.
This path is only hit for non-attention mat-muls or when flash attention is disabled.
During normal inference with FA enabled, all K·Q attention scores go through the
flash attention shader which applies the full QJL correction.

Test plan

test-quantize-fns — round-trip RMSE, dot product error, cross-type invariants (CI)
test-kv-cache-quantization.sh — perplexity regression check on real model
Vulkan perplexity matches CPU (correct dequant shaders)
Vulkan FA enabled with TBQ3/TBQ4 — 2 graph splits, full GPU saturation
Extended perplexity run with N_CHUNKS=20 N_CTX=2048
Flash attention mixed K/V type tests (test-backend-ops -o FLASH_ATTN_EXT)

Implements the MSE-optimal stage of TurboQuant (Zandieh et al., ICLR 2026) for CPU. Compresses KV cache vectors to 3.25 or 4.25 bits per value using randomized rotation + Lloyd-Max scalar quantization on each coordinate. New GGML types: GGML_TYPE_TQ3_0 (8 centroids, 3-bit indices) and GGML_TYPE_TQ4_0 (16 centroids, 4-bit indices), block size 128 (= head_dim). Includes block structs, quantize/dequantize, vec_dot via dequantize path, type registration, CLI argument support, and head_dim != 128 validation.

Replace per-type if/else chains with a data-driven table mapping each GGML type to its expected quantization and dot product error thresholds. Makes adding new quantization types cleaner.

Verify that TQ4 error < TQ3 error (more bits = less error) and that TQ3 error < TQ1/TQ2 error (TurboQuant beats ternary quantization), catching codebook or implementation regressions.

jesusmb1995 · 2026-03-27T20:57:28Z

CI at this point https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/23665348947/job/68945883399?pr=115

Shell script that runs perplexity across cache types (tq3_0, tq4_0, q4_0, q8_0, f16) on a real model and checks for PPL regressions. Auto-downloads model and wikitext-2 dataset. Validates tq4_0 PPL <= tq3_0 PPL.

Replace the O(d²) Gram-Schmidt rotation matrix with a randomized Hadamard transform: (1/√d) · H · D, where H is the Walsh-Hadamard matrix applied via Fast Hadamard Transform and D is a random ±1 diagonal. ~18x fewer ops per block (896 vs 16,384 for d=128). Also fixes the Lloyd-Max codebooks to match the correct Beta(d=128) distribution computed by scripts/compute_tq_codebooks.py. Removes the block=64 variant API (types not yet fully wired up) and adapts the quantize/dequantize helpers to use ggml_half for norm storage.

Add block=64 variants (TQ3_0_64, TQ4_0_64) for models with head_dim=64 (Llama-3.2-1B/3B). The user specifies tq3_0/tq4_0 on the CLI and the KV cache init auto-selects the d=64 internal type when head_dim=64. Includes d=64 Lloyd-Max codebooks computed from the exact Beta(d=64) PDF, separate Hadamard sign arrays (seed 43), and full CPU wiring: type_traits, quantize/dequantize functions, vec_dot, ops.cpp dispatch lists, and the test-quantize-fns thresholds. Note: d=64 quality is significantly worse than d=128 (+31% TQ3 / +7% TQ4 PPL regression vs f16) due to the wider coordinate distribution. QJL Stage 2 would help but is not yet implemented.

Add Vulkan compute shaders for TQ3_0/TQ4_0 dequantization, quantization (copy-to-quant), and SET_ROWS. The dequant shaders use 128 threads per workgroup with shared-memory Fast Hadamard Transform (7 butterfly stages). The quantize shaders (copy_to_quant.comp) implement the forward path: norm computation, sign application, FHT, codebook nearest-centroid search, and bit-packing — all in a single thread per block. Registers dequant, CPY f32→TQ, and SET_ROWS pipelines in ggml-vulkan.cpp. Updates supports_op for MUL_MAT, CPY, and SET_ROWS. Block type definitions added to types.glsl and codebook/sign constants to tq_utils.comp. V cache auto-downgrades to f16 on GPU since Flash Attention shaders cannot perform inline FHT dequantization for TQ types.

Replace the two-pass dequant→fp16→matmul fallback with fused mul_mat_vec shaders that dequantize and compute the dot product in a single GPU dispatch. Each workgroup (128 threads) processes one output row, iterating over TQ blocks with shared-memory FHT + dot accumulation. Uses subgroup shuffle (GL_KHR_shader_subgroup_shuffle) for the first log2(subgroup_size) FHT stages (no barriers), falling back to shared memory only for the remaining 2 stages (subgroup_size=32). This reduces per-block barrier count from 8 to 3. GPU utilization is improved but still ~50% vs 100% for native types (q4_0, q8_0). The remaining gap is due to the FHT synchronization overhead and 128-thread workgroup size. Further optimization would require fused Flash Attention with pre-rotated queries (rotate Q once, dot with raw codebook values — no FHT in the hot loop).

Move the Hadamard transform from inside the quantize/dequantize kernels to the graph level as an explicit MUL_MAT on Q, K, V. Since both Q and K are rotated by the same orthogonal matrix H, the rotations cancel in the attention dot product: Q·K^T = (H·Q)·(H·K)^T = Q·K^T. This allows the quantize/dequantize kernels to become pure codebook operations with no FHT, no sign flips, no shared memory barriers: - llama-graph.cpp: add ggml_rotate_hadamard() applied to Q/K/V when KV cache uses a quantized type - copy_to_quant.comp (SET_ROWS): remove tq_fht_local(), sign flips, inv_sqrt_d. Add binary-search quantize (3 comparisons for TQ3, 4 for TQ4) and packed 3-bit writes (8 indices per uint32) - dequant_tq3_0.comp, dequant_tq4_0.comp: remove inverse FHT, shared memory, sign flips. Now just codebook lookup + scale - mul_mat_vec_tq3_0.comp, mul_mat_vec_tq4_0.comp: remove subgroup shuffle FHT and shared memory FHT stages - ggml-quants.c: remove tq_forward_inplace() from quantize, tq_inverse_inplace() from dequantize SET_ROWS: 88 us -> 19 us (4.6x faster). Graph-level rotation adds <2% overhead.

Add missing Vulkan pipeline support for TQ3_0/TQ4_0 batch operations: - dequant_funcs.glsl: add TQ3_0/TQ4_0 dequant() and dequant4() for mul_mm batch matmul path (3-bit unpack + codebook lookup for TQ3, nibble unpack + codebook lookup for TQ4) - vulkan-shaders-gen.cpp: remove tq3_0/tq4_0 skip in matmul_shaders() and copy_from_quant generation - ggml-vulkan.cpp: add pipeline_dequant_mul_mat_mat registrations, pipeline_cpy_quant_f32 registrations (TQ3/TQ4 -> f32 GPU dequant), and dispatch routing in ggml_vk_get_cpy_pipeline switch statements Prevents CPU fallback for copy/dequant operations when reading from TQ3/TQ4 KV cache on GPU.

Add inline codebook dequantization to the Vulkan Flash Attention shader, enabling FA for TurboQuant KV cache types. Since optRot moved the Hadamard rotation to the graph level, the FA shader only needs a simple codebook lookup (same complexity as Q4_0's linear dequant). - flash_attn_base.glsl: add dequantize4() for TQ3_0 (3-bit unpack + TQ3_CB lookup) and TQ4_0 (nibble unpack + TQ4_CB lookup) with separate K/V buffer bindings - vulkan-shaders-gen.cpp: remove tq3_0/tq4_0 skip in FA shader generation, add to scalar FA path - ggml-vulkan.cpp: add CREATE_FA for TQ3_0/TQ4_0, add to supports_op FLASH_ATTN_EXT switch - llama-context.cpp: remove stale V-cache f16 downgrade workaround (was: "TurboQuant V cache not supported with GPU Flash Attention") Result: graph splits drop from 66 to 2, prompt eval goes from 29 tok/s to 127 tok/s (4.4x speedup). TQ3 is now within 5% of f16 performance.

Useful for models with head dimensions != 128.

Extend test-quantize-fns to run quantization round-trip (RMSE) and dot product (mul_mat) checks on any ggml backend, not just CPU. A new `-b <backend>` flag (e.g. `-b vulkan`) selects the backend; without it the test runs the original CPU-only path. The backend path builds small ggml compute graphs (ggml_cpy for round-trip, ggml_mul_mat for dot product) and executes them through ggml_backend_sched, with a CPU fallback backend for ops the primary backend does not support. Gracefully skips types whose cpy or mul_mat ops are unsupported on the chosen backend, and adds verbose debug output (`-v`) showing tensor shapes, backend support decisions, and raw GPU vs reference results to aid debugging backend-specific numerical issues.

The TQ3_0 and TQ4_0 mul_mat_vec pipelines require BLOCK_SIZE=128 (matching QUANT_K), but were compiled with SHADER_REDUCTION_MODE_SUBGROUP which only reduces within a single subgroup via subgroupAdd — with no cross-subgroup shared-memory reduction step. On GPUs where subgroup_size < 128 (e.g. NVIDIA with warp size 32), this caused only 1/N of the partial sums to be accumulated, where N = BLOCK_SIZE / subgroup_size. For a typical NVIDIA GPU the dot product result was exactly 1/4 of the correct value, producing ~0.82 normalised error in test-quantize-fns. Fix by selecting SHADER_REDUCTION_MODE_HYBRID for TQ3/TQ4 pipelines in both f32 and f16 variants, which performs subgroupAdd within each subgroup then sums across subgroups through shared memory.

jesusmb1995 · 2026-03-30T09:20:00Z

CI at this point https://github.com/tetherto/qvac-fabric-llm.cpp/actions/runs/23689471108/job/69014474350?pr=115

Add -b BACKEND flag to test-quantize-perf.cpp for running quantization benchmarks on GPU backends (e.g. vulkan, cuda) via the ggml scheduler. When -b is specified, the tool builds ggml compute graphs and benchmarks three operations through the backend scheduler: - quantize: cpy f32 -> quant (SET_ROWS equivalent) - dequantize: cpy f32 -> quant -> f32 (roundtrip) - mul_mat: quantized matmul (vec_dot equivalent) Each reports min/avg time and throughput. Ops not supported by the backend are gracefully skipped. Without -b, behavior is unchanged (direct CPU function pointer calls with cycle counting).

- Split the cache quantization taxonomy into PQ (stage-1 only) and TBQ (stage-1 + QJL stage-2 correction), including pq4_0/tbq4_0 coverage. - Implemented TBQ4 stage-2 QJL encode/correction in CPU and Vulkan paths, fixed scaling/sign-quantization behavior, and aligned struct/shader packing for block data. - Fixed Vulkan/Flash Attention regressions (symbol lookup, BLOCK_BYTE_SIZE alignment, shader generation, and flash-attention QJL correction wiring) so TBQ variants run end-to-end. - Updated validation and PR draft docs with correction-phase TBQ vs PQ results and notes.

Enable Flash Attention to use different quantization types for K and V caches (e.g. TBQ3_0 keys with PQ4_0 values), allowing finer memory/quality trade-offs per cache. - Refactored FA shader (flash_attn_base.glsl) to declare separate K and V buffer bindings with independent block sizes and dequantize functions (dequantize4_k / dequantize4_v), replacing the single shared dequantize4 that dispatched on a binding index. - Added backward-compat defines that map legacy DATA_A_* to DATA_K_*/DATA_V_* so existing same-type pipelines keep working unchanged. - Extended vk_fa_pipeline_state to carry v_type, enabling the pipeline cache to distinguish same-K-type entries that differ only in V type. - Added CREATE_FA_MIXED macro and ~30 mixed-type pipeline registrations covering all TBQ/PQ/Q8_0/F16 cross-pairs (scalar path). - Updated supports_op to accept mixed K/V pairs when at least one side is TBQ/PQ, falling back to FA_SCALAR automatically. - Added mixed-type shader generation in vulkan-shaders-gen.cpp for all relevant K/V combinations.

…erage - Extended test_flash_attn_ext to track distinct type_K and type_V (backward-compatible: single-type constructor still works). - Added mixed-type cases for TBQ/PQ plus Q8_0/F16 pairs in both MODE_TEST and MODE_PERF. - Updated vars() so test filtering can target type_K and type_V independently. Commands: ./build/bin/test-backend-ops -o FLASH_ATTN_EXT -p "tbq|pq" test ./build/bin/test-backend-ops -o FLASH_ATTN_EXT -p "tbq|pq" perf

Extend test-kv-cache-quantization.sh to run perplexity benchmarks with different quantization types for K and V caches, exercising the mixed-type Flash Attention paths added in the previous commit. - Added 6 mixed K/V pairs (e.g. K=tbq3_0/V=pq3_0, K=tbq4_0/V=q8_0) that run after the existing same-type tests. - Print a separate "Mixed K/V Results Summary" table with PPL, vs-f16 regression, and timing columns. - Check each mixed pair against the MAX_PPL_REGRESSION_PCT threshold.

zoq · 2026-04-01T13:45:20Z

Are you planning to merge this before the rebase to the latest version of llama.cpp?

jesusmb1995 self-assigned this Mar 27, 2026

github-actions bot added ggml testing labels Mar 27, 2026

jesusmb1995 added 3 commits March 27, 2026 20:10

tests: refactor quantize-fns error threshold handling

85e174f

Replace per-type if/else chains with a data-driven table mapping each GGML type to its expected quantization and dot product error thresholds. Makes adding new quantization types cleaner.

tests: add cross-type quality invariant checks for TurboQuant

4c83187

Verify that TQ4 error < TQ3 error (more bits = less error) and that TQ3 error < TQ1/TQ2 error (TurboQuant beats ternary quantization), catching codebook or implementation regressions.

jesusmb1995 force-pushed the turboquant branch from e24ed9b to 586c892 Compare March 27, 2026 19:12

jesusmb1995 changed the title ~~Draft: TurboQuant~~ TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) Mar 27, 2026

jesusmb1995 force-pushed the turboquant branch from 586c892 to d337170 Compare March 27, 2026 19:36

jesusmb1995 force-pushed the turboquant branch from d0a38c3 to 0fd733a Compare March 27, 2026 21:54

github-actions bot added the Vulkan label Mar 27, 2026

jesusmb1995 added 7 commits March 28, 2026 13:40

tests: add KV cache quantization perplexity test

8647f59

Shell script that runs perplexity across cache types (tq3_0, tq4_0, q4_0, q8_0, f16) on a real model and checks for PPL regressions. Auto-downloads model and wikitext-2 dataset. Validates tq4_0 PPL <= tq3_0 PPL.

jesusmb1995 force-pushed the turboquant branch from 0fd733a to f98eb75 Compare March 28, 2026 16:02

github-actions bot added python script labels Mar 28, 2026

jesusmb1995 added 2 commits March 28, 2026 17:31

Script to compute codebooks.

580ee50

Useful for models with head dimensions != 128.

jesusmb1995 force-pushed the turboquant branch from 121e947 to 580ee50 Compare March 28, 2026 16:33

jesusmb1995 added 2 commits March 30, 2026 11:19

jesusmb1995 added 2 commits March 30, 2026 11:52

fixCrash

d8ff44f

jesusmb1995 force-pushed the turboquant branch from d8ff44f to 69522fb Compare March 31, 2026 12:35

jesusmb1995 added 3 commits March 31, 2026 18:15

jesusmb1995 force-pushed the turboquant branch from 69522fb to 6497a86 Compare March 31, 2026 16:37

jesusmb1995 changed the title ~~TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0)~~ TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0)#115

TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0)#115
jesusmb1995 wants to merge 20 commits intotetherto:temp-7248from
jesusmb1995:turboquant

jesusmb1995 commented Mar 27, 2026 •

edited

Loading

Uh oh!

jesusmb1995 commented Mar 27, 2026

Uh oh!

jesusmb1995 commented Mar 30, 2026

Uh oh!

zoq commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jesusmb1995 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

What's new vs community state

Usage

Perplexity results

Vulkan GPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

Mixed type support (tested on AMD integrated GPU)

Vulkan per-kernel microbenchmarks (GTX 1080 Ti, 524K values, 1000 iterations)

GPU performance comparison (GTX 1080 Ti, 512 tokens prompt eval)

CPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

CPU: Llama-3.2-1B-Instruct (Q4_0 weights, head_dim=64), wikitext-2, 4 chunks × 128 ctx

Vulkan performance optimization details

SET_ROWS (K-cache write quantize kernel)

Flash Attention (the critical optimization)

Limitations

TODOs

QJL Stage 2 implementation status

Test plan

Uh oh!

jesusmb1995 commented Mar 27, 2026

Uh oh!

jesusmb1995 commented Mar 30, 2026

Uh oh!

zoq commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jesusmb1995 commented Mar 27, 2026 •

edited

Loading