PAR-054: Implement CUDA Graph Capture/Replay for Decode Loop

## Problem

Current decode loop launches ~280 kernels per token vs ~30 in llama.cpp. Each kernel launch has ~20µs overhead, resulting in ~5.6ms overhead per token. This is the primary remaining bottleneck after PAR-051 (allocation fix) achieved 20x improvement.

**Current Performance:**
- APR: 216-219 tok/s
- Ollama: 293 tok/s
- Target (2x llama.cpp): 778 tok/s
- Gap: 3.5x

## Root Cause

From [Ruiz, 2026] "PyTorch and CPU-GPU Synchronizations":
> "CPU-GPU synchronizations are a blocking operation that prevents the CPU from scheduling new work on the GPU... The CPU is said to run ahead of the GPU."

Reference: https://tomasruizt.github.io/posts/08_cpu_gpu_synchronization/

## Solution: CUDA Graph Capture

CUDA graphs capture kernel sequences and replay them with a single ~10µs launch instead of ~280 × 20µs.

### Infrastructure Added (PAR-054)

1. `CudaGraphExec` + `CaptureMode` imports
2. `decode_graph: Option<CudaGraphExec>` - cached graph executable
3. `position_buf: Option<GpuBuffer<u32>>` - device-side position for graph replay
4. `graph_input_buf: Option<GpuBuffer<f32>>` - stable input buffer for capture
5. `decode_token_count: usize` - tracks first decode for capture

### Position Indirection Challenge

CUDA graphs capture kernel parameters at capture time. The KV cache scatter kernel uses `position` to compute destination offsets. With direct position parameter, graphs can't be replayed (position changes each token).

**Solution:** `KvCacheScatterIndirectKernel` in trueno-gpu reads position from device buffer. Buffer address is constant (captured), buffer contents updated before each replay via async memcpy.

## Implementation Steps

- [ ] Allocate position buffer on first decode
- [ ] Switch to `KvCacheScatterIndirect` kernel when graph capture enabled
- [ ] Use `stream.begin_capture(CaptureMode::Global)` to capture first decode
- [ ] Use `stream.end_capture()` and `graph.instantiate()` to get executable
- [ ] On subsequent decodes: update position buffer, then `stream.launch_graph(&exec)`
- [ ] Add benchmark comparing graph vs non-graph path
- [ ] Handle edge cases (prefill vs decode, variable sequence lengths)

## Expected Improvement

- Current: ~280 kernel launches × ~20µs = 5.6ms overhead/token
- With graphs: Single graph launch ~10µs
- Improvement: ~560x reduction in launch overhead → **2-3x total speedup**

## Files to Modify

- `src/cuda.rs`: `CudaExecutor::forward_all_layers_gpu_to_logits` - add graph capture/replay
- `src/cuda.rs`: `CudaExecutor::incremental_attention_into` - use indirect scatter kernel
- `src/gguf.rs`: `OwnedQuantizedModel::generate_gpu_resident` - wire up graph mode

## Related

- PAR-051: Fixed 28 GPU allocations per token (20x improvement)
- PAR-052: KV cache scatter kernel (no perf gain - D2D copies were async)
- PAR-053: FP16 kernel infrastructure
- trueno-gpu: KvCacheScatterIndirectKernel added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAR-054: Implement CUDA Graph Capture/Replay for Decode Loop #36

Problem

Root Cause

Solution: CUDA Graph Capture

Infrastructure Added (PAR-054)

Position Indirection Challenge

Implementation Steps

Expected Improvement

Files to Modify

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PAR-054: Implement CUDA Graph Capture/Replay for Decode Loop #36

Description

Problem

Root Cause

Solution: CUDA Graph Capture

Infrastructure Added (PAR-054)

Position Indirection Challenge

Implementation Steps

Expected Improvement

Files to Modify

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions