Implement safe chunked KV writes for CUDA graph capture #6

yuz207 · 2025-10-16T02:03:41Z

Summary

Implement safe, chunked KV writes to support CUDA graph capture
Introduce per-segment acceptance tracking and slicing helpers
Extend GPUModelRunner to return acceptance counts alongside mask

Changes

KV Cache

Add _slice_scale_segment to support per-segment scale slicing based on entry length
Extend DeferredWriteManager with:
- _req_start_offsets to map draft tokens to request offsets
- commit now accepting a sequence of accepted counts instead of a boolean mask
- Per-entry segmentation logic to compute accepted segments across all draft requests
- Optimized path for full-entry, single-segment acceptance
- Per-segment slicing for keys, values, slots, and scales, with proper device handling
- Robust exception handling that triggers fallback with metrics updates
- Clearing of request offsets on flush/clear

GPU Model Runner

_build_nwor_acceptance_mask now returns a tuple (mask, counts)
Vectorized path updated to propagate and return acceptance counts
_scv_vectorized_mask now returns (mask_work, accepted_counts) when available
Commit path updated to pass accepted_counts to DeferredWriteManager.commit

Tests

Update tests to reflect new API:
- Use manager.commit([1]) instead of a boolean mask
- Validate that _build_nwor_acceptance_mask returns (mask, counts) and that counts match expectations
- Extend tests to verify per-segment acceptance counts in vectorized paths

Why

Enables safe CUDA graph capture by supporting partial and segmented writes without risking incorrect slices or race conditions
Provides explicit acceptance tracking across multiple draft requests, increasing robustness in graph-captured execution

API Changes

DeferredWriteManager.commit signature changed from commit(mask: Tensor) to commit(accepted_counts: Sequence[int])
_build_nwor_acceptance_mask now returns (mask: Tensor, counts: list[int]) instead of a sole mask

Backward Compatibility

Internal refactors align with CUDA graph capture goals; external call sites within this repo have been updated accordingly. If any external callers relied on the old signature, they will need to adapt to the new accepted_counts interface.

Testing Plan

Run unit tests for KV cache deferred writer: pytest tests/v1/test_deferred_writer.py
Validate that commit calls with [1] produce expected writes and that counts are correctly propagated
Validate _build_nwor_acceptance_mask returns both mask and counts and that counts align with draft token segmentation

Notes for reviewers

The new per-segment logic ensures safe writes even when CUDA graph capture requires partial acceptance across multiple draft requests
Some parts of the implementation introduce additional branch paths; expected behavior remains correct including fallback behavior on failure

🌿 Generated by Terry

ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/7ed05cf2-ef51-40c2-b908-a395d47b0386

…or deferred KV writes - Changed DeferredWriteManager.commit to accept counts per request rather than boolean masks. - Updated commit logic to handle partial segment acceptance and slicing of scales accordingly. - Modified GPUModelRunner to build and pass accepted counts alongside acceptance masks. - Updated tests to reflect new commit interface and logic. This change enables finer-grained control over commit stages by tracking counts of accepted tokens per draft request, improving correctness and flexibility in deferred key-value caching mechanism. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>

…dexing - Replace commit_draft_layer with restore_rejected_drafts for CoW semantics * Accepted tokens already in cache from reshape_and_cache_flash (no extra work) * Rejected tokens restored from log buffers * Handle FP8 per-token scales in restoration - Make torch.cuda.synchronize() conditional via VLLM_NWOR_DEBUG_SYNC (ISSUE #6) - Fix fallback indexing bug (ISSUE #4): * Map mask indices to batch positions via _draft_positions * Prevents silent corruption when kernel fallback is triggered This completes the Python-side CoW implementation. CUDA kernel restore_rejected_drafts will be added next.

yuz207 marked this pull request as ready for review October 16, 2025 02:03

yuz207 closed this Oct 20, 2025

yuz207 deleted the terragon/optimize-scv-with-cudagraph-capture-zljdbl branch October 25, 2025 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement safe chunked KV writes for CUDA graph capture #6

Implement safe chunked KV writes for CUDA graph capture #6

Uh oh!

yuz207 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement safe chunked KV writes for CUDA graph capture #6

Implement safe chunked KV writes for CUDA graph capture #6

Uh oh!

Conversation

yuz207 commented Oct 16, 2025

Summary

Changes

KV Cache

GPU Model Runner

Tests

Why

API Changes

Backward Compatibility

Testing Plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants