Skip to content

memory : add GPU-to-GPU checkpoint save/restore for recurrent state#2

Open
petter-b wants to merge 1 commit intofeature/checkpoint-fixes-refactoredfrom
feature/gpu-to-gpu-checkpoint-copy-clean
Open

memory : add GPU-to-GPU checkpoint save/restore for recurrent state#2
petter-b wants to merge 1 commit intofeature/checkpoint-fixes-refactoredfrom
feature/gpu-to-gpu-checkpoint-copy-clean

Conversation

@petter-b
Copy link
Copy Markdown
Owner

@petter-b petter-b commented Mar 24, 2026

cont #1 related upstream ggml-org#19493

Checkpoint-based speculative decoding serializes ~50 MiB of recurrent
state to host memory on every speculative cycle. On CUDA this costs
169-598 ms per save operation — 37% of total generation time.

This adds a device-to-device copy path using ggml_backend_tensor_copy,
allocating shadow tensors alongside the recurrent r/s tensors.

  • llama_memory_checkpoint_save/restore/delete/supported public C API
    on llama_memory_i, implemented for llama_memory_recurrent
  • Shadow tensor allocation in llama_memory_recurrent (~50 MiB lazily)
  • Server callbacks wired to GPU copy path with serialization fallback
  • Checkpoint timing instrumentation (save, restore, seq_rm durations)

Metal uses the same API but falls back to malloc→get→set→free on unified
memory (~0.4 ms per op — still negligible).

Per-operation checkpoint latency (RTX 3090, Qwen3.5-27B)

Path Save latency Restore latency
Serialization (host) 169-598 ms ~23 ms
GPU copy (device) <1 ms <1 ms

GPU copy eliminates the serialization bottleneck — per-operation cost
drops by 100-1000x.

Note: sustained throughput with ngram speculation is still limited by
draft acceptance rate, not checkpoint latency. See
petter-b/llama.cpp#1
for detailed throughput observations.

Tests

Existing checkpoint tests pass (10/10). New test:
test_spec_checkpoint_gpu_copy.py — verifies save/restore round-trip
correctness and timing.

Files changed (12 files, +530/-28)

  • include/llama.h — public C API
  • src/llama-memory.h — virtual checkpoint methods on llama_memory_i
  • src/llama-memory-recurrent.h/cpp — shadow tensors, save/restore impl
  • src/llama-context.cpp — API bridge
  • tools/server/server-context.cpp — callback wiring, timing instrumentation
  • 5 test files updated/added

Disclosure

Claude Code was used extensively for research, debugging, test generation, code implementation and writing this PR body. Validation is through TDD — each feature has a failing test before the implementation, verified on CUDA RTX 3090 and Mac-mini M4 Metal.

cc @srogmann

Add llama_memory_checkpoint_save/restore public C API for direct
device-to-device checkpoint copy, eliminating the host serialization
bottleneck (169-598 ms → <1 ms per operation).

- Allocate shadow tensors alongside recurrent r/s tensors
- Use ggml_backend_tensor_copy for device-to-device transfer
- Wire server checkpoint callbacks to GPU copy path
- Fall back to serialization when GPU copy is unavailable
- Add checkpoint timing instrumentation (save, restore, seq_rm)
- Add GPU checkpoint copy correctness tests (10 tests)
@petter-b petter-b changed the base branch from master to feature/checkpoint-fixes-refactored March 24, 2026 12:42
@petter-b petter-b marked this pull request as ready for review March 24, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant