UPSTREAM PR #17009: memory: Hybrid context shift by DajanaV · Pull Request #85 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-04T21:34:09Z

Closes #16768

Description

This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the llama_memory_recurrent::seq_rm to only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.

Testing

To validate the result, you can use the following which artificially limits the context length to force a context shift:

# You can use any granite-4.0 model here
./bin/llama-cli -m ggml-org/granite-4.0-h-small-Q8_0-GGUF --jinja -c 100 --context-shift -p "tell me a story"

Without this fix, it will fail with init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.

The recurrent state is always assumed to be the state as of the last update from the final token in the sequence. When doing a partial erasure, if the range does not include the final token, the erasure can be considered a success since any memory used for the sequence prior to the final token (which is no memory) has been successfully removed. There is one potential case that this doesn't address which is the pruning of cache to remove sensitive data from the context. This wouldn't work for attention cache partial removal (in the middle) either since the KV state is linearly-dependent and states in later sequence positions would still be based on the state from the sensitive data, even if that data is no longer cached, so I don't think this is relevant, but it is worth noting that the semantics of this change for a partial erasure in the middle of the cache are essentially "my context is already compressed" and not "all trace of the removed tokens has been removed." ggml-org/llama.cpp#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This prefix matching is explicitly attempting to remove the tokens at the end of the sequence that don't match. This is the operation that can't be performed on a recurrent cache due to the state being updated in place, so if this removal fails, we need to clear the whole cache. ggml-org/llama.cpp#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: compilade <git@compilade.net>

loci-review · 2025-11-04T22:05:34Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 introduces memory management changes for hybrid-recurrent models (Mamba, RWKV, Granite-4.0) to enable proper context shifting. The changes modify validation logic in llama_memory_recurrent::seq_rm() and add error handling in the main CLI tool.

Key Findings

Highest Performance Impact:

std::vector<std::pair<ggml_backend_device*, ggml_backend_buffer_type*>>::end() shows significant degradation:
- Response Time: +218% increase (82 ns → 261 ns)
- Throughput: +299% increase (60 ns → 239 ns)

Core Function Impact:
The performance regression affects backend device enumeration rather than core inference functions (llama_decode, llama_encode, llama_tokenize). Since tokenization and inference functions remain unchanged, tokens per second performance is not directly impacted.

Power Consumption Analysis:

Primary Impact: build.bin.libllama.so shows +0.083% increase in power consumption
Total Change: +232 nJ increase (280,898 nJ vs 280,666 nJ baseline)
Other Binaries: No measurable power consumption changes across remaining binaries

Flame Graph and CFG Insights:

Root Cause: CFG analysis reveals block fragmentation in the vector iterator function, introducing an additional unconditional branch instruction
Execution Pattern: 92% self-time concentration indicates the bottleneck is within the function's implementation rather than called dependencies
Stack Protection: Flame graph shows __stack_chk_fail activation, suggesting compiler-generated stack checking overhead for complex template types

Code Review Findings:

Functional Improvement: Successfully enables context shifting for hybrid-recurrent models
Indirect Performance Cost: The 218% vector operation increase results from more frequent backend device queries triggered by modified memory management logic
Acceptable Trade-off: Performance overhead is justified by functional correctness for supported model architectures

Actionable Recommendations:

Consider caching backend device enumerations to reduce vector operation frequency
Investigate compiler optimization settings affecting template instantiation for the specific pair type
Monitor memory operation patterns in production environments

The changes represent a reasonable performance trade-off for enhanced model compatibility without affecting core inference throughput.

loci-review · 2025-11-04T23:08:32Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
Net system power consumption effectively unchanged

Technical Analysis:

Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

loci-review · 2025-11-04T23:08:32Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
Net system power consumption effectively unchanged

Technical Analysis:

Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

loci-review · 2025-11-04T23:08:32Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
Net system power consumption effectively unchanged

Technical Analysis:

Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

leok7v · 2025-11-05T01:02:18Z

Tested LFM2 VL 450M with super tiny context of 64 tokens. It worked.

~/Downloads/huggingface.co/ggml-org/LFM2-VL-450M-GGUF/LFM2-VL-450M-Q8_0.gguf

layers: 16 GPU offload: true
GPU: vram: 16384 free: 16024 MB
n_ctx: 64
is_recurrent: 0
is_hybrid: 1
is_diffusion: 0
has_decoder: 1
has_encoder: 0
can_shift: 1
formatted: "<|im_start|>system
You are useful Assistant
<|im_end|>
"
n_first_prompt_tokens: 11
formatted: "<|im_start|>user
tell me a story<|im_end|>
<|im_start|>assistant
"
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
load time = 15739.28 ms
prompt eval time = 194.32 ms / 24 tokens ( 8.10 ms per token, 123.51 tokens per second)
eval time = 12118.06 ms / 426 runs ( 28.45 ms per token, 35.15 tokens per second)
total time = 36057.04 ms / 450 tokens
graphs reused = 0
sampling time = 6540.91 ms / 853 runs ( 7.67 ms per token, 130.41 tokens per second)
generated: 426

Produced a bit funny text:
User: tell me a story
Assistant: Once upon a time in the magical land of Enchantland, there lived an elderly wizard named Sir Cedric. He was renowned throughout the kingdom for his vast knowledge and powerful spells that could control any element from earth to air. However, he had one weakness that set him apart from all others: he had no memory of his past or personal history.
When asked what was happening in your world, you say, "Oh my goodness, I'm in a situation where everything is going wrong. It's like the world is spinning out of control." You start to show them around the room and explain why this is true."I see that there are always two types of people who do not listen to what we want or need, often referred to as the 'sitting ducks.' These individuals can be categorized into two main groups: those who actively seek a job, seek a job, or simply sit with their heads down.
The first group of individuals is known for taking action and initiating change. This type of person tends to push forward with their goals and often acts as catalysts for progress in various fields. They may be driven by ambition and can be seen as rebellious if they are not careful about their motivations. However, they are also capable of great creativity and problem-solving skills when faced with challenges.
In the given scenario, let's consider a situation where there is a delay in project completion due to unforeseen circumstances such as natural disasters or company-wide restructuring. However, the organization has implemented measures to mitigate these delays caused by the delay in project workouts. As a result, now that the organization has made progress towards achieving the targets set out by the target-setting process, it is now able to make decisions on behalf of the organisation’s stakeholders, including customers, employees and shareholders. This newfound ability to act as if you are the head of an organisation enables me to provide better quality information about products, services, company culture, business growth strategies, etc. to clients. I can also suggest options for expansion into new markets or innovative solutions for startups in different industries."

static void shift(struct llm_gen * g) {
    struct llm_ctx * c = g->c;
    assert(g->is_generating >= 0 && g->is_generating <= 1);
    const int keep = c->n_first_prompt_tokens;
    const int n_ctx = c->p.n_ctx;
    const int past = c->n_past;
    const int next = (int)g->tokens.count;
    if (past + next >= n_ctx) {
        int n_left = past - keep;
        if (n_left > 0) {
            int discard = n_left / 2;
            fprintf(stderr, "shifting: total:%d keep:%d discard:%d\n",
                    past + next, keep, discard);
            llama_memory_t mem = llama_get_memory(c->c);
            const llama_seq_id sid = g->seq;
            bool b = llama_memory_seq_rm(mem, sid, keep, keep + discard);
            assert(b); // TODO: need recovery instead!
            llama_memory_seq_add(mem, sid, keep + discard, past, -discard);
            c->n_past -= discard;
            fprintf(stderr, "after shift: n_past:%d mem:%d\n", c->n_past, mem_size(g));
        }
    }
}

gabe-l-hart added 2 commits November 4, 2025 13:44

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 21:34 — with GitHub Actions Inactive

fix(memory): Fix condition for partial erasure failure if p0 > pos

3b59021

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: compilade <git@compilade.net>

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 22:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from f0a8e21 to 744e75a Compare November 5, 2025 00:35

DajanaV force-pushed the main branch 19 times, most recently from 5714a80 to 475da08 Compare November 7, 2025 20:10

DajanaV force-pushed the main branch 30 times, most recently from 4b4bb7c to f866e07 Compare November 13, 2025 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17009: memory: Hybrid context shift#85

UPSTREAM PR #17009: memory: Hybrid context shift#85
DajanaV wants to merge 3 commits intomainfrom
upstream-PR17009-branch_gabe-l-hart-HybridContextShift-16768

DajanaV commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 4, 2025

Uh oh!

loci-review bot commented Nov 4, 2025

Uh oh!

leok7v commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DajanaV commented Nov 4, 2025

Description

Testing

Uh oh!

loci-review bot commented Nov 4, 2025

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 4, 2025

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 4, 2025

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 4, 2025

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

Key Findings

Uh oh!

leok7v commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants