Skip to content

memory: respect unified KV cache in hybrid memory for eval tasks#21224

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
mudler:fix/hybrid-memory-unified-kv
Apr 1, 2026
Merged

memory: respect unified KV cache in hybrid memory for eval tasks#21224
ggerganov merged 1 commit intoggml-org:masterfrom
mudler:fix/hybrid-memory-unified-kv

Conversation

@mudler
Copy link
Copy Markdown
Contributor

@mudler mudler commented Mar 31, 2026

Overview

The hybrid memory paths (llama-memory-hybrid.cpp and llama-memory-hybrid-iswa.cpp) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR #19954 fixed this for llama-kv-cache-iswa.cpp by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):

  • HellaSwag: 83.0% (400 tasks)
  • Winogrande: 74.5% (400 tasks)
  • MMLU: 41.2%
  • ARC-Challenge: 56.2%
  • TruthfulQA: 37.7% All previously failed with llama_decode() error.

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Claude was used during a benchmarking session for my quants and bumped into this as I could not have run evals - I've checked the related PRs, and guided it to do the changes needed

The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR ggml-org#19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since ggml-org#19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.
@mudler mudler requested a review from ggerganov as a code owner March 31, 2026 11:05
@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Mar 31, 2026
@ggerganov ggerganov merged commit e1cb817 into ggml-org:master Apr 1, 2026
45 of 46 checks passed
icex added a commit to icex/llama.cpp that referenced this pull request Apr 5, 2026
Includes:
- fix: handle non-capturing groups (?:...) in JSON schema pattern converter (ggml-org#21124)
- memory: respect unified KV cache in hybrid memory for eval tasks (ggml-org#21224)
- fix: CUDA FA kernel selection, head dimension 512 support
- rotate activations for better quantization (ggml-org#21038)
- Various parser, jinja, webui, and CI fixes

Conflicts resolved:
- llama-kv-cache.cpp: keep TurboQuant InnerQ stubs + upstream Hadamard helpers
- llama-graph.cpp: keep TurboQuant V-padding + upstream self_v_rot
- fattn-tile.cu: add upstream D=512 before TurboQuant HIP guard
- fattn.cu: combine D=512 (upstream) + D=640 (TurboQuant) exclusions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants