Skip to content

TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0)#115

Draft
jesusmb1995 wants to merge 20 commits intotetherto:temp-7248from
jesusmb1995:turboquant
Draft

TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0)#115
jesusmb1995 wants to merge 20 commits intotetherto:temp-7248from
jesusmb1995:turboquant

Conversation

@jesusmb1995
Copy link
Copy Markdown

@jesusmb1995 jesusmb1995 commented Mar 27, 2026

Summary

Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.

Paper: https://arxiv.org/pdf/2504.19874
Community discussion: ggml-org#20969
Related upstream PR: ggml-org#21038 (graph-level rotation for existing quant types)

How it works

  1. Apply a Hadamard rotation at the graph level to Q, K, V before the KV cache (same approach as upstream #21038)
  2. Compute the L2 norm of each head vector (128 floats) and store it
  3. Quantize each coordinate independently using pre-computed Lloyd-Max codebook centroids via binary search (3 comparisons for TBQ3, 4 for TBQ4)
  4. Bit-pack the indices (3-bit for TBQ3, 4-bit nibbles for TBQ4)

The rotation decorrelates coordinates, making each follow a known Beta distribution. A fixed codebook (no per-block calibration) achieves near-optimal MSE. The rotation cancels in the attention dot product: Q·K^T = (H·Q)·(H·K)^T = Q·K^T.

Type Bits/val Block size Compression vs FP16 Codebook
pq3_0 3.25 52 B (48+4) 4.9x 8 centroids
pq4_0 4.25 68 B (64+4) 3.8x 16 centroids

What's new vs community state

The upstream discussion and community implementations so far have:

  • Pseudocode-level implementations or standalone Python scripts
  • Dense O(d²) random rotation via Gram-Schmidt (slow, 64KB matrix for d=128)
  • No integration into the llama.cpp type system or KV cache path

This PR adds:

  • Full llama.cpp integration — new GGML types (pq3_0, pq4_0, TBQ3_0_64, TBQ4_0_64), block structs, type traits, CPU backend registration, KV cache support via --cache-type-k/--cache-type-v
  • Graph-level Hadamard rotation (optRot) — rotation is a graph-level MUL_MAT on Q/K/V, keeping quantize/dequantize kernels as pure codebook operations. Compatible with all backend optimizations (FA, mmq, fused ops). <2% overhead.
  • Correct Lloyd-Max codebooks computed from the exact Beta PDF for d=128 and d=64 via scripts/compute_tq_codebooks.py
  • head_dim=64 auto-detection — user specifies pq3_0/pq4_0 on the CLI; KV cache init automatically selects the d=64 internal type when head_dim=64
  • Vulkan GPU backend with Flash Attention — inline codebook dequant in FA shader, fused mul_mat_vec, SET_ROWS quantize with binary-search + packed writes, batch matmul (mul_mm) support
  • GPU fully saturated — TBQ3 runs within 5% of f16 speed with FA enabled (was 4.6x slower without FA)
  • Tests — unit tests with cross-type invariant checks + perplexity regression test script

Usage

llama-cli -m model.gguf --cache-type-k pq3_0 --cache-type-v tbq3_0
llama-cli -m model.gguf --cache-type-k pq4_0 --cache-type-v tbq4_0
llama-server -m model.gguf --cache-type-k pq4_0 --cache-type-v tbq4_0

Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.

Perplexity results

Vulkan GPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

With Flash Attention enabled (GPU fully saturated, disclaimer a bit noisy need to run for longer):

==========================================
 KV Cache Quantization Perplexity Test
==========================================
 Model:    Mistral-7B-Instruct-v0.3-Q4_K_S.gguf
 Dataset:  wikitext-2-raw/wiki.test.raw
 Chunks:   4
 Context:  128
==========================================

==========================================
 Results Summary
==========================================
  Type              PPL       vs f16       Time
  ----              ---       ------       ----
  tbq3_0         7.0183        1.82%    20.74s (1.2x)
  q4_0           6.8399       -0.77%    17.48s (1.0x)
  pq3_0          7.0780        2.69%    17.31s (1.0x)
  tbq4_0         6.8013       -1.33%    20.24s (1.2x)
  f16            6.8928   (baseline)     17.29s
  q8_0           6.8920       -0.01%    17.27s (1.0x)
  pq4_0          6.8001       -1.34%    17.59s (1.0x)
==========================================
WARNING: tbq4_0 PPL (6.8013) should be <= pq4_0 PPL (6.8001)

q4_0, pq3_0, and pq4_0 remain in the same throughput tier as f16 under Flash Attention. Here, PQ = phase-1 / polar-style quantization (Lloyd-Max codebook + random rotation + dequant); TBQ = phase-1 + phase-2 QJL correction (the full TurboQuant path), so tbq3_0/tbq4_0 are the QJL-corrected variants.
In this snapshot, TBQ variants have higher wall time than PQ/Q4_0, with tbq4_0 close in quality to pq4_0 after QJL correction.

  • tbq3_0 typically benefits more than tbq4_0 over its PQ counterpart when the residual is meaningful, because 3-bit leaves more coarse structure for QJL to recover; at 4-bit, the codebook already preserves most structure so tbq4_0 has much less clean correction headroom and measured gains are often near-noise.
  • All quantized types benefit from the Hadamard rotation

Mixed type support (tested on AMD integrated GPU)

==========================================
 Mixed K/V Results Summary
==========================================
  K type       V type              PPL       vs f16       Time
  ------       ------              ---       ------       ----
  tbq3_0       pq3_0            7.0303        1.91%    56.88s (1.8x)
  tbq4_0       pq4_0            6.8159       -1.19%    53.49s (1.7x)
  tbq3_0       q8_0             7.0515        2.22%    52.63s (1.6x)
  tbq4_0       f16              6.7608       -1.99%    58.00s (1.8x)
  q8_0         pq3_0            6.9426        0.64%    31.20s (1.0x)
  f16          pq4_0            6.8901       -0.12%    31.41s (1.0x)
==========================================

Vulkan per-kernel microbenchmarks (GTX 1080 Ti, 524K values, 1000 iterations)

Isolated kernel benchmarks via test-quantize-perf -b vulkan:

q4_0 pq3_0 pq4_0
quantize (us) 48 73 73
mul_mat (us) 685 389 389
  • Quantize: TQ is ~1.5x slower than Q4_0 due to codebook binary search + norm computation vs simple min/max scaling. Acceptable — SET_ROWS runs once per token per layer.
  • Mul_mat: TQ is ~1.8x faster — lower bits-per-value (3.25-4.25 vs 4.5) means less memory traffic on bandwidth-bound 1-row matmuls.
  • TQ dequant is not a standalone GPU op — it runs inline inside FA and mul_mat_vec shaders.

GPU performance comparison (GTX 1080 Ti, 512 tokens prompt eval)

Config FA Splits tok/s vs f16
f16 baseline enabled 2 133 1.00x
Q4_0 + optRot enabled 2 131 0.98x
TBQ3 + optRot enabled 2 127 0.95x
TBQ3 (before FA fix) disabled 66 29 0.22x

CPU: Mistral-7B (Q4_K_S weights, head_dim=128), wikitext-2, 4 chunks × 128 ctx

  Type              PPL       vs f16
  q4_0           6.8381       -0.66%
  pq4_0          6.8124       -1.03%
  f16            6.8834   (baseline)
  q8_0           6.9026        0.28%
  pq3_0          6.9917        1.57%

CPU: Llama-3.2-1B-Instruct (Q4_0 weights, head_dim=64), wikitext-2, 4 chunks × 128 ctx

Note: d=64 support is preliminary. QJL Stage 2 (not yet implemented) is critical for d=64 quality.

  Type              PPL       vs f16
  pq3_0         16.9446       31.35%
  q4_0          14.5194       12.55%
  pq4_0         13.7621        6.68%
  f16           12.9001   (baseline)
  q8_0          12.9042        0.03%

Vulkan performance optimization details

SET_ROWS (K-cache write quantize kernel)

Stage SET_ROWS (us) vs Q4_0
Original (FHT inside kernel) 88 9.6x slower
+ optRot (Hadamard removed from kernel) 44 4.8x slower
+ binary search + packed 3-bit writes 19 2.1x slower
Q4_0 baseline 9 1.0x

Flash Attention (the critical optimization)

Without FA, TBQ3 was 4.6x slower than f16 due to 66 graph splits (each a GPU sync point) and explicit dequant-then-matmul fallback. Adding inline codebook dequant to the FA shader brought this to 5% overhead:

  • flash_attn_base.glsl: inline dequantize4() for TBQ3 (3-bit unpack + codebook lookup) and TBQ4 (nibble unpack + codebook lookup)
  • Since optRot moved the Hadamard to graph level, FA dequant is just a codebook lookup — same complexity as Q4_0's linear dequant

Limitations

  • head_dim must be 64 or 128. Codebooks and Hadamard transform are pre-computed for these dimensions.
  • Stage 1 only (MSE quantizer) — the paper's quality neutrality at 3.5 bpw requires Stage 2 (QJL), not yet implemented
  • d=64 quality is poor — +31% TBQ3 / +7% TBQ4 regression without QJL Stage 2
  • Metal shaders not yet implemented

TODOs

  • QJL Stage 2 — 1-bit residual correction for unbiased inner products (critical for d=64, would improve d=128 from ~2.7% to ~0%)
  • head_dim=64 support — for Llama-3.2-1B/3B (auto-detected, codebooks computed)
  • Vulkan GPU backend — dequant, quantize, fused mul_mat_vec, batch matmul, cpy_quant_f30
  • Flash Attention — inline codebook dequant in FA shader, GPU fully saturated, TBQ3 within 5% of f16
  • Flash Attention — for missmatched quantized types
  • Graph-level rotation (optRot) — Hadamard as graph-level MUL_MAT, pure codebook kernels
  • Binary-search quantize — 3 comparisons for TBQ3 (was 8 distance calculations), packed 3-bit writes
  • GPU mul_mat_vec QJL correction — add QJL dot correction to mul_mat_vec_*.comp / mul_mat_vec_*.comp (currently Stage 1 only; FA path is complete)
  • SIMD optimization — AVX2/NEON for CPU quantize/dequantize
  • Metal shaders — Apple GPU backend support
  • Coopmat1/Coopmat2 FA — currently scalar FA only
  • Longer perplexity validation — LongBench, Needle-in-a-Haystack
  • 4-BIT
  • 3-BIT
  • 2-BIT
  • MSE-only Top1%
  • Bias and Variance

QJL Stage 2 implementation status

Path QJL Encode QJL Correction Status
CPU quantize (d=128 & d=64) YES Complete
CPU vec_dot (d=128 & d=64) YES Complete
CPU dequantize — (by design) Complete
GPU cpy_f32_quant YES Complete
GPU flash attention YES Complete
GPU dequant — (by design) Complete
GPU mul_mat_vec NO TODO

Note: GPU mul_mat_vec does not apply QJL correction — it uses Stage 1 only.
This path is only hit for non-attention mat-muls or when flash attention is disabled.
During normal inference with FA enabled, all K·Q attention scores go through the
flash attention shader which applies the full QJL correction.

Test plan

  • test-quantize-fns — round-trip RMSE, dot product error, cross-type invariants (CI)
  • test-kv-cache-quantization.sh — perplexity regression check on real model
  • Vulkan perplexity matches CPU (correct dequant shaders)
  • Vulkan FA enabled with TBQ3/TBQ4 — 2 graph splits, full GPU saturation
  • Extended perplexity run with N_CHUNKS=20 N_CTX=2048
  • Flash attention mixed K/V type tests (test-backend-ops -o FLASH_ATTN_EXT)

Implements the MSE-optimal stage of TurboQuant (Zandieh et al., ICLR 2026)
for CPU. Compresses KV cache vectors to 3.25 or 4.25 bits per value using
randomized rotation + Lloyd-Max scalar quantization on each coordinate.

New GGML types: GGML_TYPE_TQ3_0 (8 centroids, 3-bit indices) and
GGML_TYPE_TQ4_0 (16 centroids, 4-bit indices), block size 128 (= head_dim).
Includes block structs, quantize/dequantize, vec_dot via dequantize path,
type registration, CLI argument support, and head_dim != 128 validation.
Replace per-type if/else chains with a data-driven table mapping each
GGML type to its expected quantization and dot product error thresholds.
Makes adding new quantization types cleaner.
Verify that TQ4 error < TQ3 error (more bits = less error) and that
TQ3 error < TQ1/TQ2 error (TurboQuant beats ternary quantization),
catching codebook or implementation regressions.
@jesusmb1995 jesusmb1995 changed the title Draft: TurboQuant TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) Mar 27, 2026
@jesusmb1995
Copy link
Copy Markdown
Author

Shell script that runs perplexity across cache types (tq3_0, tq4_0, q4_0,
q8_0, f16) on a real model and checks for PPL regressions. Auto-downloads
model and wikitext-2 dataset. Validates tq4_0 PPL <= tq3_0 PPL.
Replace the O(d²) Gram-Schmidt rotation matrix with a randomized
Hadamard transform: (1/√d) · H · D, where H is the Walsh-Hadamard
matrix applied via Fast Hadamard Transform and D is a random ±1
diagonal. ~18x fewer ops per block (896 vs 16,384 for d=128).

Also fixes the Lloyd-Max codebooks to match the correct Beta(d=128)
distribution computed by scripts/compute_tq_codebooks.py. Removes
the block=64 variant API (types not yet fully wired up) and adapts
the quantize/dequantize helpers to use ggml_half for norm storage.
Add block=64 variants (TQ3_0_64, TQ4_0_64) for models with head_dim=64
(Llama-3.2-1B/3B). The user specifies tq3_0/tq4_0 on the CLI and the
KV cache init auto-selects the d=64 internal type when head_dim=64.

Includes d=64 Lloyd-Max codebooks computed from the exact Beta(d=64) PDF,
separate Hadamard sign arrays (seed 43), and full CPU wiring: type_traits,
quantize/dequantize functions, vec_dot, ops.cpp dispatch lists, and the
test-quantize-fns thresholds.

Note: d=64 quality is significantly worse than d=128 (+31% TQ3 / +7% TQ4
PPL regression vs f16) due to the wider coordinate distribution. QJL
Stage 2 would help but is not yet implemented.
Add Vulkan compute shaders for TQ3_0/TQ4_0 dequantization, quantization
(copy-to-quant), and SET_ROWS. The dequant shaders use 128 threads per
workgroup with shared-memory Fast Hadamard Transform (7 butterfly stages).

The quantize shaders (copy_to_quant.comp) implement the forward path:
norm computation, sign application, FHT, codebook nearest-centroid
search, and bit-packing — all in a single thread per block.

Registers dequant, CPY f32→TQ, and SET_ROWS pipelines in ggml-vulkan.cpp.
Updates supports_op for MUL_MAT, CPY, and SET_ROWS. Block type definitions
added to types.glsl and codebook/sign constants to tq_utils.comp.

V cache auto-downgrades to f16 on GPU since Flash Attention shaders
cannot perform inline FHT dequantization for TQ types.
Replace the two-pass dequant→fp16→matmul fallback with fused
mul_mat_vec shaders that dequantize and compute the dot product in a
single GPU dispatch. Each workgroup (128 threads) processes one output
row, iterating over TQ blocks with shared-memory FHT + dot accumulation.

Uses subgroup shuffle (GL_KHR_shader_subgroup_shuffle) for the first
log2(subgroup_size) FHT stages (no barriers), falling back to shared
memory only for the remaining 2 stages (subgroup_size=32). This reduces
per-block barrier count from 8 to 3.

GPU utilization is improved but still ~50% vs 100% for native types
(q4_0, q8_0). The remaining gap is due to the FHT synchronization
overhead and 128-thread workgroup size. Further optimization would
require fused Flash Attention with pre-rotated queries (rotate Q once,
dot with raw codebook values — no FHT in the hot loop).
Move the Hadamard transform from inside the quantize/dequantize kernels
to the graph level as an explicit MUL_MAT on Q, K, V. Since both Q and K
are rotated by the same orthogonal matrix H, the rotations cancel in the
attention dot product: Q·K^T = (H·Q)·(H·K)^T = Q·K^T.

This allows the quantize/dequantize kernels to become pure codebook
operations with no FHT, no sign flips, no shared memory barriers:

- llama-graph.cpp: add ggml_rotate_hadamard() applied to Q/K/V when KV
  cache uses a quantized type
- copy_to_quant.comp (SET_ROWS): remove tq_fht_local(), sign flips,
  inv_sqrt_d. Add binary-search quantize (3 comparisons for TQ3, 4 for
  TQ4) and packed 3-bit writes (8 indices per uint32)
- dequant_tq3_0.comp, dequant_tq4_0.comp: remove inverse FHT, shared
  memory, sign flips. Now just codebook lookup + scale
- mul_mat_vec_tq3_0.comp, mul_mat_vec_tq4_0.comp: remove subgroup
  shuffle FHT and shared memory FHT stages
- ggml-quants.c: remove tq_forward_inplace() from quantize,
  tq_inverse_inplace() from dequantize

SET_ROWS: 88 us -> 19 us (4.6x faster). Graph-level rotation adds <2%
overhead.
Add missing Vulkan pipeline support for TQ3_0/TQ4_0 batch operations:

- dequant_funcs.glsl: add TQ3_0/TQ4_0 dequant() and dequant4() for
  mul_mm batch matmul path (3-bit unpack + codebook lookup for TQ3,
  nibble unpack + codebook lookup for TQ4)
- vulkan-shaders-gen.cpp: remove tq3_0/tq4_0 skip in matmul_shaders()
  and copy_from_quant generation
- ggml-vulkan.cpp: add pipeline_dequant_mul_mat_mat registrations,
  pipeline_cpy_quant_f32 registrations (TQ3/TQ4 -> f32 GPU dequant),
  and dispatch routing in ggml_vk_get_cpy_pipeline switch statements

Prevents CPU fallback for copy/dequant operations when reading from
TQ3/TQ4 KV cache on GPU.
Add inline codebook dequantization to the Vulkan Flash Attention shader,
enabling FA for TurboQuant KV cache types. Since optRot moved the
Hadamard rotation to the graph level, the FA shader only needs a simple
codebook lookup (same complexity as Q4_0's linear dequant).

- flash_attn_base.glsl: add dequantize4() for TQ3_0 (3-bit unpack +
  TQ3_CB lookup) and TQ4_0 (nibble unpack + TQ4_CB lookup) with
  separate K/V buffer bindings
- vulkan-shaders-gen.cpp: remove tq3_0/tq4_0 skip in FA shader
  generation, add to scalar FA path
- ggml-vulkan.cpp: add CREATE_FA for TQ3_0/TQ4_0, add to supports_op
  FLASH_ATTN_EXT switch
- llama-context.cpp: remove stale V-cache f16 downgrade workaround
  (was: "TurboQuant V cache not supported with GPU Flash Attention")

Result: graph splits drop from 66 to 2, prompt eval goes from 29 tok/s
to 127 tok/s (4.4x speedup). TQ3 is now within 5% of f16 performance.
Useful for models with head dimensions != 128.
Extend test-quantize-fns to run quantization round-trip (RMSE) and dot
product (mul_mat) checks on any ggml backend, not just CPU.

A new `-b <backend>` flag (e.g. `-b vulkan`) selects the backend;
without it the test runs the original CPU-only path.  The backend path
builds small ggml compute graphs (ggml_cpy for round-trip, ggml_mul_mat
for dot product) and executes them through ggml_backend_sched, with a
CPU fallback backend for ops the primary backend does not support.

Gracefully skips types whose cpy or mul_mat ops are unsupported on the
chosen backend, and adds verbose debug output (`-v`) showing tensor
shapes, backend support decisions, and raw GPU vs reference results to
aid debugging backend-specific numerical issues.
The TQ3_0 and TQ4_0 mul_mat_vec pipelines require BLOCK_SIZE=128
(matching QUANT_K), but were compiled with SHADER_REDUCTION_MODE_SUBGROUP
which only reduces within a single subgroup via subgroupAdd — with no
cross-subgroup shared-memory reduction step.

On GPUs where subgroup_size < 128 (e.g. NVIDIA with warp size 32), this
caused only 1/N of the partial sums to be accumulated, where
N = BLOCK_SIZE / subgroup_size.  For a typical NVIDIA GPU the dot product
result was exactly 1/4 of the correct value, producing ~0.82 normalised
error in test-quantize-fns.

Fix by selecting SHADER_REDUCTION_MODE_HYBRID for TQ3/TQ4 pipelines in
both f32 and f16 variants, which performs subgroupAdd within each
subgroup then sums across subgroups through shared memory.
@jesusmb1995
Copy link
Copy Markdown
Author

Add -b BACKEND flag to test-quantize-perf.cpp for running quantization
benchmarks on GPU backends (e.g. vulkan, cuda) via the ggml scheduler.

When -b is specified, the tool builds ggml compute graphs and benchmarks
three operations through the backend scheduler:
- quantize: cpy f32 -> quant (SET_ROWS equivalent)
- dequantize: cpy f32 -> quant -> f32 (roundtrip)
- mul_mat: quantized matmul (vec_dot equivalent)

Each reports min/avg time and throughput. Ops not supported by the
backend are gracefully skipped. Without -b, behavior is unchanged
(direct CPU function pointer calls with cycle counting).
- Split the cache quantization taxonomy into PQ (stage-1 only) and TBQ (stage-1 + QJL stage-2 correction), including pq4_0/tbq4_0 coverage.
- Implemented TBQ4 stage-2 QJL encode/correction in CPU and Vulkan paths, fixed scaling/sign-quantization behavior, and aligned struct/shader packing for block data.
- Fixed Vulkan/Flash Attention regressions (symbol lookup, BLOCK_BYTE_SIZE alignment, shader generation, and flash-attention QJL correction wiring) so TBQ variants run end-to-end.
- Updated validation and PR draft docs with correction-phase TBQ vs PQ results and notes.
Enable Flash Attention to use different quantization types for K and V
caches (e.g. TBQ3_0 keys with PQ4_0 values), allowing finer memory/quality
trade-offs per cache.

- Refactored FA shader (flash_attn_base.glsl) to declare separate K and V
  buffer bindings with independent block sizes and dequantize functions
  (dequantize4_k / dequantize4_v), replacing the single shared dequantize4
  that dispatched on a binding index.
- Added backward-compat defines that map legacy DATA_A_* to DATA_K_*/DATA_V_*
  so existing same-type pipelines keep working unchanged.
- Extended vk_fa_pipeline_state to carry v_type, enabling the pipeline cache
  to distinguish same-K-type entries that differ only in V type.
- Added CREATE_FA_MIXED macro and ~30 mixed-type pipeline registrations
  covering all TBQ/PQ/Q8_0/F16 cross-pairs (scalar path).
- Updated supports_op to accept mixed K/V pairs when at least one side is
  TBQ/PQ, falling back to FA_SCALAR automatically.
- Added mixed-type shader generation in vulkan-shaders-gen.cpp for all
  relevant K/V combinations.
…erage

- Extended test_flash_attn_ext to track distinct type_K and type_V (backward-compatible: single-type constructor still works).
- Added mixed-type cases for TBQ/PQ plus Q8_0/F16 pairs in both MODE_TEST and MODE_PERF.
- Updated vars() so test filtering can target type_K and type_V independently.

Commands:
  ./build/bin/test-backend-ops -o FLASH_ATTN_EXT -p "tbq|pq" test
  ./build/bin/test-backend-ops -o FLASH_ATTN_EXT -p "tbq|pq" perf
Extend test-kv-cache-quantization.sh to run perplexity benchmarks with
different quantization types for K and V caches, exercising the mixed-type
Flash Attention paths added in the previous commit.

- Added 6 mixed K/V pairs (e.g. K=tbq3_0/V=pq3_0, K=tbq4_0/V=q8_0) that
  run after the existing same-type tests.
- Print a separate "Mixed K/V Results Summary" table with PPL, vs-f16
  regression, and timing columns.
- Check each mixed pair against the MAX_PPL_REGRESSION_PCT threshold.
@jesusmb1995 jesusmb1995 changed the title TurboQuant: KV cache quantization with Hadamard transform (TQ3_0 / TQ4_0) TurboQuant: KV cache quantization with Hadamard transform (TBQ3_0 / TBQ4_0) Mar 31, 2026
@zoq
Copy link
Copy Markdown

zoq commented Apr 1, 2026

Are you planning to merge this before the rebase to the latest version of llama.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants