kv-cache : do not quantize SWA KV cache#21277
Conversation
|
Gemma 4 SWA KV cache is hulariously large, so this commit really made it unusable. |
Hmm, shouldn't it only be using 1024 for the sliding window? |
|
@ggerganov would it be feasible to have yet another flag to control how the SWA cache is quantized? On my CDNA2 card, I'm developing an optimized q8 KV path (compute small batches directly without dequantizing), as it is much faster on this card, for many reasons. |
|
I think it's best to revert the change from this PR. Will do it now. |
|
I still think this might be worth taking a look at - the sliding window size Unless there is a reason for this? Maybe I misunderstand something, but I thought I would mention it here. |
|
This is expected, we need extra space beyond the llama.cpp/src/llama-kv-cache-iswa.cpp Lines 47 to 51 in 57ace0d |
Merge 59 upstream commits including: - model: support gemma 4 (vision + moe, no audio) (ggml-org#21309) - kv-cache: do not quantize SWA KV cache (ggml-org#21277) - Preserve RotorQuant exclusion from Hadamard rotation
Includes: - server: Fix undefined timing measurement errors (ggml-org#21201) - server: save and clear idle slots on new task --clear-idle (ggml-org#20993) - common: fix tool call type detection for nullable/enum schemas (ggml-org#21327) - CUDA: fix FA kernel selection logic (ggml-org#21271) - kv-cache: do not quantize SWA KV cache (ggml-org#21277) + revert (ggml-org#21332) - common/parser: fix call ID detection + atomicity (ggml-org#21230) - jinja: coerce input for string-specific filters (ggml-org#21370) - Various CI, HIP, WebGPU, and documentation fixes



Overview
cont #21038
We don't need to quantize the SWA part of the cache for iSWA models because it is relatively small. So keep it in F16.
Requirements