memory: respect unified KV cache in hybrid memory for eval tasks by mudler · Pull Request #21224 · ggml-org/llama.cpp

mudler · 2026-03-31T11:05:13Z

Overview

The hybrid memory paths (llama-memory-hybrid.cpp and llama-memory-hybrid-iswa.cpp) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR #19954 fixed this for llama-kv-cache-iswa.cpp by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):

HellaSwag: 83.0% (400 tasks)
Winogrande: 74.5% (400 tasks)
MMLU: 41.2%
ARC-Challenge: 56.2%
TruthfulQA: 37.7% All previously failed with llama_decode() error.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Claude was used during a benchmarking session for my quants and bumped into this as I could not have run evals - I've checked the related PRs, and guided it to do the changes needed

The hybrid memory paths (`llama-memory-hybrid.cpp` and `llama-memory-hybrid-iswa.cpp`) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) PR ggml-org#19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated. This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since ggml-org#19954. Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model): - HellaSwag: 83.0% (400 tasks) - Winogrande: 74.5% (400 tasks) - MMLU: 41.2% - ARC-Challenge: 56.2% - TruthfulQA: 37.7% All previously failed with llama_decode() error.

Includes: - fix: handle non-capturing groups (?:...) in JSON schema pattern converter (ggml-org#21124) - memory: respect unified KV cache in hybrid memory for eval tasks (ggml-org#21224) - fix: CUDA FA kernel selection, head dimension 512 support - rotate activations for better quantization (ggml-org#21038) - Various parser, jinja, webui, and CI fixes Conflicts resolved: - llama-kv-cache.cpp: keep TurboQuant InnerQ stubs + upstream Hadamard helpers - llama-graph.cpp: keep TurboQuant V-padding + upstream self_v_rot - fattn-tile.cu: add upstream D=512 before TurboQuant HIP guard - fattn.cu: combine D=512 (upstream) + D=640 (TurboQuant) exclusions

mudler requested a review from ggerganov as a code owner March 31, 2026 11:05

ggerganov approved these changes Mar 31, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Mar 31, 2026

ggerganov merged commit e1cb817 into ggml-org:master Apr 1, 2026
45 of 46 checks passed

angt mentioned this pull request Apr 1, 2026

llama-perplexity broken due to sequential=true #21171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory: respect unified KV cache in hybrid memory for eval tasks#21224

memory: respect unified KV cache in hybrid memory for eval tasks#21224
ggerganov merged 1 commit intoggml-org:masterfrom
mudler:fix/hybrid-memory-unified-kv

mudler commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mudler commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mudler commented Mar 31, 2026 •

edited

Loading