UPSTREAM PR #17009: memory: Hybrid context shift#85
Conversation
The recurrent state is always assumed to be the state as of the last update from the final token in the sequence. When doing a partial erasure, if the range does not include the final token, the erasure can be considered a success since any memory used for the sequence prior to the final token (which is no memory) has been successfully removed. There is one potential case that this doesn't address which is the pruning of cache to remove sensitive data from the context. This wouldn't work for attention cache partial removal (in the middle) either since the KV state is linearly-dependent and states in later sequence positions would still be based on the state from the sensitive data, even if that data is no longer cached, so I don't think this is relevant, but it is worth noting that the semantics of this change for a partial erasure in the middle of the cache are essentially "my context is already compressed" and not "all trace of the removed tokens has been removed." ggml-org/llama.cpp#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This prefix matching is explicitly attempting to remove the tokens at the end of the sequence that don't match. This is the operation that can't be performed on a recurrent cache due to the state being updated in place, so if this removal fails, we need to clear the whole cache. ggml-org/llama.cpp#16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: compilade <git@compilade.net>
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: PR #85 - Hybrid Context ShiftOverviewPR #85 introduces memory management changes for hybrid-recurrent models (Mamba, RWKV, Granite-4.0) to enable proper context shifting. The changes modify validation logic in Key FindingsHighest Performance Impact:
Core Function Impact: Power Consumption Analysis:
Flame Graph and CFG Insights:
Code Review Findings:
Actionable Recommendations:
The changes represent a reasonable performance trade-off for enhanced model compatibility without affecting core inference throughput. |
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: PR #85 - Hybrid Context ShiftOverviewPR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in Key FindingsPerformance Impact:
Tokens Per Second Impact: Power Consumption Analysis:
Technical Analysis:
Code Review Insights: Conclusion: |
2 similar comments
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: PR #85 - Hybrid Context ShiftOverviewPR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in Key FindingsPerformance Impact:
Tokens Per Second Impact: Power Consumption Analysis:
Technical Analysis:
Code Review Insights: Conclusion: |
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: PR #85 - Hybrid Context ShiftOverviewPR #85 implements memory management improvements for hybrid-recurrent models, addressing context shift failures. The changes are localized to memory management logic in Key FindingsPerformance Impact:
Tokens Per Second Impact: Power Consumption Analysis:
Technical Analysis:
Code Review Insights: Conclusion: |
|
Tested LFM2 VL 450M with super tiny context of 64 tokens. It worked. ~/Downloads/huggingface.co/ggml-org/LFM2-VL-450M-GGUF/LFM2-VL-450M-Q8_0.gguf layers: 16 GPU offload: true Produced a bit funny text: |
5714a80 to
475da08
Compare
4b4bb7c to
f866e07
Compare
Mirrored from ggml-org/llama.cpp#17009
Closes #16768
cc @leok7v
Description
This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the
llama_memory_recurrent::seq_rmto only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.Testing
To validate the result, you can use the following which artificially limits the context length to force a context shift:
Without this fix, it will fail with
init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.