-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Problem
Current decode loop launches ~280 kernels per token vs ~30 in llama.cpp. Each kernel launch has ~20µs overhead, resulting in ~5.6ms overhead per token. This is the primary remaining bottleneck after PAR-051 (allocation fix) achieved 20x improvement.
Current Performance:
- APR: 216-219 tok/s
- Ollama: 293 tok/s
- Target (2x llama.cpp): 778 tok/s
- Gap: 3.5x
Root Cause
From [Ruiz, 2026] "PyTorch and CPU-GPU Synchronizations":
"CPU-GPU synchronizations are a blocking operation that prevents the CPU from scheduling new work on the GPU... The CPU is said to run ahead of the GPU."
Reference: https://tomasruizt.github.io/posts/08_cpu_gpu_synchronization/
Solution: CUDA Graph Capture
CUDA graphs capture kernel sequences and replay them with a single ~10µs launch instead of ~280 × 20µs.
Infrastructure Added (PAR-054)
CudaGraphExec+CaptureModeimportsdecode_graph: Option<CudaGraphExec>- cached graph executableposition_buf: Option<GpuBuffer<u32>>- device-side position for graph replaygraph_input_buf: Option<GpuBuffer<f32>>- stable input buffer for capturedecode_token_count: usize- tracks first decode for capture
Position Indirection Challenge
CUDA graphs capture kernel parameters at capture time. The KV cache scatter kernel uses position to compute destination offsets. With direct position parameter, graphs can't be replayed (position changes each token).
Solution: KvCacheScatterIndirectKernel in trueno-gpu reads position from device buffer. Buffer address is constant (captured), buffer contents updated before each replay via async memcpy.
Implementation Steps
- Allocate position buffer on first decode
- Switch to
KvCacheScatterIndirectkernel when graph capture enabled - Use
stream.begin_capture(CaptureMode::Global)to capture first decode - Use
stream.end_capture()andgraph.instantiate()to get executable - On subsequent decodes: update position buffer, then
stream.launch_graph(&exec) - Add benchmark comparing graph vs non-graph path
- Handle edge cases (prefill vs decode, variable sequence lengths)
Expected Improvement
- Current: ~280 kernel launches × ~20µs = 5.6ms overhead/token
- With graphs: Single graph launch ~10µs
- Improvement: ~560x reduction in launch overhead → 2-3x total speedup
Files to Modify
src/cuda.rs:CudaExecutor::forward_all_layers_gpu_to_logits- add graph capture/replaysrc/cuda.rs:CudaExecutor::incremental_attention_into- use indirect scatter kernelsrc/gguf.rs:OwnedQuantizedModel::generate_gpu_resident- wire up graph mode
Related
- PAR-051: Fixed 28 GPU allocations per token (20x improvement)
- PAR-052: KV cache scatter kernel (no perf gain - D2D copies were async)
- PAR-053: FP16 kernel infrastructure
- trueno-gpu: KvCacheScatterIndirectKernel added