Skip to content

PAR-054: Implement CUDA Graph Capture/Replay for Decode Loop #36

@noahgift

Description

@noahgift

Problem

Current decode loop launches ~280 kernels per token vs ~30 in llama.cpp. Each kernel launch has ~20µs overhead, resulting in ~5.6ms overhead per token. This is the primary remaining bottleneck after PAR-051 (allocation fix) achieved 20x improvement.

Current Performance:

  • APR: 216-219 tok/s
  • Ollama: 293 tok/s
  • Target (2x llama.cpp): 778 tok/s
  • Gap: 3.5x

Root Cause

From [Ruiz, 2026] "PyTorch and CPU-GPU Synchronizations":

"CPU-GPU synchronizations are a blocking operation that prevents the CPU from scheduling new work on the GPU... The CPU is said to run ahead of the GPU."

Reference: https://tomasruizt.github.io/posts/08_cpu_gpu_synchronization/

Solution: CUDA Graph Capture

CUDA graphs capture kernel sequences and replay them with a single ~10µs launch instead of ~280 × 20µs.

Infrastructure Added (PAR-054)

  1. CudaGraphExec + CaptureMode imports
  2. decode_graph: Option<CudaGraphExec> - cached graph executable
  3. position_buf: Option<GpuBuffer<u32>> - device-side position for graph replay
  4. graph_input_buf: Option<GpuBuffer<f32>> - stable input buffer for capture
  5. decode_token_count: usize - tracks first decode for capture

Position Indirection Challenge

CUDA graphs capture kernel parameters at capture time. The KV cache scatter kernel uses position to compute destination offsets. With direct position parameter, graphs can't be replayed (position changes each token).

Solution: KvCacheScatterIndirectKernel in trueno-gpu reads position from device buffer. Buffer address is constant (captured), buffer contents updated before each replay via async memcpy.

Implementation Steps

  • Allocate position buffer on first decode
  • Switch to KvCacheScatterIndirect kernel when graph capture enabled
  • Use stream.begin_capture(CaptureMode::Global) to capture first decode
  • Use stream.end_capture() and graph.instantiate() to get executable
  • On subsequent decodes: update position buffer, then stream.launch_graph(&exec)
  • Add benchmark comparing graph vs non-graph path
  • Handle edge cases (prefill vs decode, variable sequence lengths)

Expected Improvement

  • Current: ~280 kernel launches × ~20µs = 5.6ms overhead/token
  • With graphs: Single graph launch ~10µs
  • Improvement: ~560x reduction in launch overhead → 2-3x total speedup

Files to Modify

  • src/cuda.rs: CudaExecutor::forward_all_layers_gpu_to_logits - add graph capture/replay
  • src/cuda.rs: CudaExecutor::incremental_attention_into - use indirect scatter kernel
  • src/gguf.rs: OwnedQuantizedModel::generate_gpu_resident - wire up graph mode

Related

  • PAR-051: Fixed 28 GPU allocations per token (20x improvement)
  • PAR-052: KV cache scatter kernel (no perf gain - D2D copies were async)
  • PAR-053: FP16 kernel infrastructure
  • trueno-gpu: KvCacheScatterIndirectKernel added

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions