Skip to content

UPSTREAM PR #17009: memory: Hybrid context shift#85

Open
DajanaV wants to merge 3 commits intomainfrom
upstream-PR17009-branch_gabe-l-hart-HybridContextShift-16768
Open

UPSTREAM PR #17009: memory: Hybrid context shift#85
DajanaV wants to merge 3 commits intomainfrom
upstream-PR17009-branch_gabe-l-hart-HybridContextShift-16768

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#17009

Closes #16768

cc @leok7v

Description

This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the llama_memory_recurrent::seq_rm to only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.

Testing

To validate the result, you can use the following which artificially limits the context length to force a context shift:

# You can use any granite-4.0 model here
./bin/llama-cli -m ggml-org/granite-4.0-h-small-Q8_0-GGUF --jinja -c 100 --context-shift -p "tell me a story"

Without this fix, it will fail with init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

ggml-org/llama.cpp#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

ggml-org/llama.cpp#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>
@loci-review
Copy link

loci-review bot commented Nov 4, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 introduces memory management changes for hybrid-recurrent models (Mamba, RWKV, Granite-4.0) to enable proper context shifting. The changes modify validation logic in llama_memory_recurrent::seq_rm() and add error handling in the main CLI tool.

Key Findings

Highest Performance Impact:

  • std::vector<std::pair<ggml_backend_device*, ggml_backend_buffer_type*>>::end() shows significant degradation:
    • Response Time: +218% increase (82 ns → 261 ns)
    • Throughput: +299% increase (60 ns → 239 ns)

Core Function Impact:
The performance regression affects backend device enumeration rather than core inference functions (llama_decode, llama_encode, llama_tokenize). Since tokenization and inference functions remain unchanged, tokens per second performance is not directly impacted.

Power Consumption Analysis:

  • Primary Impact: build.bin.libllama.so shows +0.083% increase in power consumption
  • Total Change: +232 nJ increase (280,898 nJ vs 280,666 nJ baseline)
  • Other Binaries: No measurable power consumption changes across remaining binaries

Flame Graph and CFG Insights:

  • Root Cause: CFG analysis reveals block fragmentation in the vector iterator function, introducing an additional unconditional branch instruction
  • Execution Pattern: 92% self-time concentration indicates the bottleneck is within the function's implementation rather than called dependencies
  • Stack Protection: Flame graph shows __stack_chk_fail activation, suggesting compiler-generated stack checking overhead for complex template types

Code Review Findings:

  • Functional Improvement: Successfully enables context shifting for hybrid-recurrent models
  • Indirect Performance Cost: The 218% vector operation increase results from more frequent backend device queries triggered by modified memory management logic
  • Acceptable Trade-off: Performance overhead is justified by functional correctness for supported model architectures

Actionable Recommendations:

  • Consider caching backend device enumerations to reduce vector operation frequency
  • Investigate compiler optimization settings affecting template instantiation for the specific pair type
  • Monitor memory operation patterns in production environments

The changes represent a reasonable performance trade-off for enhanced model compatibility without affecting core inference throughput.

@loci-review
Copy link

loci-review bot commented Nov 4, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

  • Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
  • Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
  • Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

  • build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
  • build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
  • Net system power consumption effectively unchanged

Technical Analysis:

  • Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
  • CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
  • Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

2 similar comments
@loci-review
Copy link

loci-review bot commented Nov 4, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

  • Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
  • Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
  • Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

  • build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
  • build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
  • Net system power consumption effectively unchanged

Technical Analysis:

  • Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
  • CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
  • Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

@loci-review
Copy link

loci-review bot commented Nov 4, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #85 - Hybrid Context Shift

Overview

PR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in llama-memory-recurrent.cpp and error handling in tools/main/main.cpp, with no direct modifications to core inference functions.

Key Findings

Performance Impact:

  • Highest Response Time change: fragment_buffer_variant constructor (+9.47%, 252 ns → 276 ns)
  • Highest Throughput degradation: llama_context::clear_adapter_lora() (-19.58%, actually an improvement: 50 ns → 40 ns)
  • Core inference functions remain unaffected - no changes detected in llama_decode(), llama_encode(), or llama_tokenize()

Tokens Per Second Impact:
Based on the reference model performance (7% reduction when llama_decode increases by 2ms), no meaningful impact on inference throughput is expected since core tokenization and inference functions show no performance degradation.

Power Consumption Analysis:

  • build.bin.libllama.so: minimal increase (+0.054%, +151 nJ)
  • build.bin.llama-tts: complete removal (-100%, -322,782 nJ saved)
  • Net system power consumption effectively unchanged

Technical Analysis:

  • Flame Graph: Constructor shows increased overhead in string operations and error handling paths (28 ns in abort calls, 21 ns in string management)
  • CFG Comparison: Identical control flow but memory address shifts (+0x1000 base, -0x10 offset) causing cache misalignment issues
  • Root Cause: Memory layout changes affect cache locality, not algorithmic inefficiencies

Code Review Insights:
The changes successfully relax memory management restrictions for hybrid-recurrent models while adding robust error handling. The relaxed validation logic ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) allows more memory operations to proceed, with fallback mechanisms preventing context shift failures.

Conclusion:
The performance impact represents an acceptable trade-off for improved correctness and robustness in hybrid-recurrent model handling. The changes enhance system reliability without affecting core inference performance.

@leok7v
Copy link

leok7v commented Nov 5, 2025

Tested LFM2 VL 450M with super tiny context of 64 tokens. It worked.

~/Downloads/huggingface.co/ggml-org/LFM2-VL-450M-GGUF/LFM2-VL-450M-Q8_0.gguf

layers: 16 GPU offload: true
GPU: vram: 16384 free: 16024 MB
n_ctx: 64
is_recurrent: 0
is_hybrid: 1
is_diffusion: 0
has_decoder: 1
has_encoder: 0
can_shift: 1
formatted: "<|im_start|>system
You are useful Assistant
<|im_end|>
"
n_first_prompt_tokens: 11
formatted: "<|im_start|>user
tell me a story<|im_end|>
<|im_start|>assistant
"
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
pos_min: 58 pos_max: 58
shifting: total:72 keep:11 discard:24
after shift: n_past:35 mem:35
load time = 15739.28 ms
prompt eval time = 194.32 ms / 24 tokens ( 8.10 ms per token, 123.51 tokens per second)
eval time = 12118.06 ms / 426 runs ( 28.45 ms per token, 35.15 tokens per second)
total time = 36057.04 ms / 450 tokens
graphs reused = 0
sampling time = 6540.91 ms / 853 runs ( 7.67 ms per token, 130.41 tokens per second)
generated: 426

Produced a bit funny text:
User: tell me a story
Assistant: Once upon a time in the magical land of Enchantland, there lived an elderly wizard named Sir Cedric. He was renowned throughout the kingdom for his vast knowledge and powerful spells that could control any element from earth to air. However, he had one weakness that set him apart from all others: he had no memory of his past or personal history.
When asked what was happening in your world, you say, "Oh my goodness, I'm in a situation where everything is going wrong. It's like the world is spinning out of control." You start to show them around the room and explain why this is true."I see that there are always two types of people who do not listen to what we want or need, often referred to as the 'sitting ducks.' These individuals can be categorized into two main groups: those who actively seek a job, seek a job, or simply sit with their heads down.
The first group of individuals is known for taking action and initiating change. This type of person tends to push forward with their goals and often acts as catalysts for progress in various fields. They may be driven by ambition and can be seen as rebellious if they are not careful about their motivations. However, they are also capable of great creativity and problem-solving skills when faced with challenges.
In the given scenario, let's consider a situation where there is a delay in project completion due to unforeseen circumstances such as natural disasters or company-wide restructuring. However, the organization has implemented measures to mitigate these delays caused by the delay in project workouts. As a result, now that the organization has made progress towards achieving the targets set out by the target-setting process, it is now able to make decisions on behalf of the organisation’s stakeholders, including customers, employees and shareholders. This newfound ability to act as if you are the head of an organisation enables me to provide better quality information about products, services, company culture, business growth strategies, etc. to clients. I can also suggest options for expansion into new markets or innovative solutions for startups in different industries."

static void shift(struct llm_gen * g) {
    struct llm_ctx * c = g->c;
    assert(g->is_generating >= 0 && g->is_generating <= 1);
    const int keep = c->n_first_prompt_tokens;
    const int n_ctx = c->p.n_ctx;
    const int past = c->n_past;
    const int next = (int)g->tokens.count;
    if (past + next >= n_ctx) {
        int n_left = past - keep;
        if (n_left > 0) {
            int discard = n_left / 2;
            fprintf(stderr, "shifting: total:%d keep:%d discard:%d\n",
                    past + next, keep, discard);
            llama_memory_t mem = llama_get_memory(c->c);
            const llama_seq_id sid = g->seq;
            bool b = llama_memory_seq_rm(mem, sid, keep, keep + discard);
            assert(b); // TODO: need recovery instead!
            llama_memory_seq_add(mem, sid, keep + discard, past, -discard);
            c->n_past -= discard;
            fprintf(stderr, "after shift: n_past:%d mem:%d\n", c->n_past, mem_size(g));
        }
    }
}

@DajanaV DajanaV force-pushed the main branch 19 times, most recently from 5714a80 to 475da08 Compare November 7, 2025 20:10
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 4b4bb7c to f866e07 Compare November 13, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants