memory : add GPU-to-GPU checkpoint save/restore for recurrent state by petter-b · Pull Request #2 · petter-b/llama.cpp

petter-b · 2026-03-24T12:34:26Z

cont #1 related upstream ggml-org#19493

Checkpoint-based speculative decoding serializes ~50 MiB of recurrent
state to host memory on every speculative cycle. On CUDA this costs
169-598 ms per save operation — 37% of total generation time.

This adds a device-to-device copy path using ggml_backend_tensor_copy,
allocating shadow tensors alongside the recurrent r/s tensors.

llama_memory_checkpoint_save/restore/delete/supported public C API
on llama_memory_i, implemented for llama_memory_recurrent
Shadow tensor allocation in llama_memory_recurrent (~50 MiB lazily)
Server callbacks wired to GPU copy path with serialization fallback
Checkpoint timing instrumentation (save, restore, seq_rm durations)

Metal uses the same API but falls back to malloc→get→set→free on unified
memory (~0.4 ms per op — still negligible).

Per-operation checkpoint latency (RTX 3090, Qwen3.5-27B)

Path	Save latency	Restore latency
Serialization (host)	169-598 ms	~23 ms
GPU copy (device)	<1 ms	<1 ms

GPU copy eliminates the serialization bottleneck — per-operation cost
drops by 100-1000x.

Note: sustained throughput with ngram speculation is still limited by
draft acceptance rate, not checkpoint latency. See
petter-b/llama.cpp#1
for detailed throughput observations.

Tests

Existing checkpoint tests pass (10/10). New test:
test_spec_checkpoint_gpu_copy.py — verifies save/restore round-trip
correctness and timing.

Files changed (12 files, +530/-28)

include/llama.h — public C API
src/llama-memory.h — virtual checkpoint methods on llama_memory_i
src/llama-memory-recurrent.h/cpp — shadow tensors, save/restore impl
src/llama-context.cpp — API bridge
tools/server/server-context.cpp — callback wiring, timing instrumentation
5 test files updated/added

Disclosure

Claude Code was used extensively for research, debugging, test generation, code implementation and writing this PR body. Validation is through TDD — each feature has a failing test before the implementation, verified on CUDA RTX 3090 and Mac-mini M4 Metal.

cc @srogmann

Add llama_memory_checkpoint_save/restore public C API for direct device-to-device checkpoint copy, eliminating the host serialization bottleneck (169-598 ms → <1 ms per operation). - Allocate shadow tensors alongside recurrent r/s tensors - Use ggml_backend_tensor_copy for device-to-device transfer - Wire server checkpoint callbacks to GPU copy path - Fall back to serialization when GPU copy is unavailable - Add checkpoint timing instrumentation (save, restore, seq_rm) - Add GPU checkpoint copy correctness tests (10 tests)

petter-b changed the base branch from master to feature/checkpoint-fixes-refactored March 24, 2026 12:42

petter-b marked this pull request as ready for review March 24, 2026 12:43

petter-b mentioned this pull request Mar 24, 2026

server : fix speculative checkpoint bugs on hybrid models #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory : add GPU-to-GPU checkpoint save/restore for recurrent state#2

memory : add GPU-to-GPU checkpoint save/restore for recurrent state#2
petter-b wants to merge 1 commit intofeature/checkpoint-fixes-refactoredfrom
feature/gpu-to-gpu-checkpoint-copy-clean

petter-b commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petter-b commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Per-operation checkpoint latency (RTX 3090, Qwen3.5-27B)

Tests

Files changed (12 files, +530/-28)

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

petter-b commented Mar 24, 2026 •

edited

Loading