memory : add GPU-to-GPU checkpoint save/restore for recurrent state#2
Open
petter-b wants to merge 1 commit intofeature/checkpoint-fixes-refactoredfrom
Open
Conversation
Add llama_memory_checkpoint_save/restore public C API for direct device-to-device checkpoint copy, eliminating the host serialization bottleneck (169-598 ms → <1 ms per operation). - Allocate shadow tensors alongside recurrent r/s tensors - Use ggml_backend_tensor_copy for device-to-device transfer - Wire server checkpoint callbacks to GPU copy path - Fall back to serialization when GPU copy is unavailable - Add checkpoint timing instrumentation (save, restore, seq_rm) - Add GPU checkpoint copy correctness tests (10 tests)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cont #1 related upstream ggml-org#19493
Checkpoint-based speculative decoding serializes ~50 MiB of recurrent
state to host memory on every speculative cycle. On CUDA this costs
169-598 ms per save operation — 37% of total generation time.
This adds a device-to-device copy path using
ggml_backend_tensor_copy,allocating shadow tensors alongside the recurrent r/s tensors.
llama_memory_checkpoint_save/restore/delete/supportedpublic C APIon
llama_memory_i, implemented forllama_memory_recurrentllama_memory_recurrent(~50 MiB lazily)Metal uses the same API but falls back to malloc→get→set→free on unified
memory (~0.4 ms per op — still negligible).
Per-operation checkpoint latency (RTX 3090, Qwen3.5-27B)
GPU copy eliminates the serialization bottleneck — per-operation cost
drops by 100-1000x.
Note: sustained throughput with ngram speculation is still limited by
draft acceptance rate, not checkpoint latency. See
petter-b/llama.cpp#1
for detailed throughput observations.
Tests
Existing checkpoint tests pass (10/10). New test:
test_spec_checkpoint_gpu_copy.py— verifies save/restore round-tripcorrectness and timing.
Files changed (12 files, +530/-28)
include/llama.h— public C APIsrc/llama-memory.h— virtual checkpoint methods onllama_memory_isrc/llama-memory-recurrent.h/cpp— shadow tensors, save/restore implsrc/llama-context.cpp— API bridgetools/server/server-context.cpp— callback wiring, timing instrumentationDisclosure
Claude Code was used extensively for research, debugging, test generation, code implementation and writing this PR body. Validation is through TDD — each feature has a failing test before the implementation, verified on CUDA RTX 3090 and Mac-mini M4 Metal.
cc @srogmann