Skip to content

Conversation

@yuz207
Copy link

@yuz207 yuz207 commented Oct 16, 2025

Summary

  • Implement safe, chunked KV writes to support CUDA graph capture
  • Introduce per-segment acceptance tracking and slicing helpers
  • Extend GPUModelRunner to return acceptance counts alongside mask

Changes

KV Cache

  • Add _slice_scale_segment to support per-segment scale slicing based on entry length
  • Extend DeferredWriteManager with:
    • _req_start_offsets to map draft tokens to request offsets
    • commit now accepting a sequence of accepted counts instead of a boolean mask
    • Per-entry segmentation logic to compute accepted segments across all draft requests
    • Optimized path for full-entry, single-segment acceptance
    • Per-segment slicing for keys, values, slots, and scales, with proper device handling
    • Robust exception handling that triggers fallback with metrics updates
    • Clearing of request offsets on flush/clear

GPU Model Runner

  • _build_nwor_acceptance_mask now returns a tuple (mask, counts)
  • Vectorized path updated to propagate and return acceptance counts
  • _scv_vectorized_mask now returns (mask_work, accepted_counts) when available
  • Commit path updated to pass accepted_counts to DeferredWriteManager.commit

Tests

  • Update tests to reflect new API:
    • Use manager.commit([1]) instead of a boolean mask
    • Validate that _build_nwor_acceptance_mask returns (mask, counts) and that counts match expectations
    • Extend tests to verify per-segment acceptance counts in vectorized paths

Why

  • Enables safe CUDA graph capture by supporting partial and segmented writes without risking incorrect slices or race conditions
  • Provides explicit acceptance tracking across multiple draft requests, increasing robustness in graph-captured execution

API Changes

  • DeferredWriteManager.commit signature changed from commit(mask: Tensor) to commit(accepted_counts: Sequence[int])
  • _build_nwor_acceptance_mask now returns (mask: Tensor, counts: list[int]) instead of a sole mask

Backward Compatibility

  • Internal refactors align with CUDA graph capture goals; external call sites within this repo have been updated accordingly. If any external callers relied on the old signature, they will need to adapt to the new accepted_counts interface.

Testing Plan

  • Run unit tests for KV cache deferred writer: pytest tests/v1/test_deferred_writer.py
  • Validate that commit calls with [1] produce expected writes and that counts are correctly propagated
  • Validate _build_nwor_acceptance_mask returns both mask and counts and that counts align with draft token segmentation

Notes for reviewers

  • The new per-segment logic ensures safe writes even when CUDA graph capture requires partial acceptance across multiple draft requests
  • Some parts of the implementation introduce additional branch paths; expected behavior remains correct including fallback behavior on failure

🌿 Generated by Terry


ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/7ed05cf2-ef51-40c2-b908-a395d47b0386

…or deferred KV writes

- Changed DeferredWriteManager.commit to accept counts per request rather than boolean masks.
- Updated commit logic to handle partial segment acceptance and slicing of scales accordingly.
- Modified GPUModelRunner to build and pass accepted counts alongside acceptance masks.
- Updated tests to reflect new commit interface and logic.

This change enables finer-grained control over commit stages by tracking counts of accepted tokens per draft request, improving correctness and flexibility in deferred key-value caching mechanism.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@yuz207 yuz207 marked this pull request as ready for review October 16, 2025 02:03
@yuz207 yuz207 closed this Oct 20, 2025
yuz207 added a commit that referenced this pull request Oct 24, 2025
…dexing

- Replace commit_draft_layer with restore_rejected_drafts for CoW semantics
  * Accepted tokens already in cache from reshape_and_cache_flash (no extra work)
  * Rejected tokens restored from log buffers
  * Handle FP8 per-token scales in restoration
- Make torch.cuda.synchronize() conditional via VLLM_NWOR_DEBUG_SYNC (ISSUE #6)
- Fix fallback indexing bug (ISSUE #4):
  * Map mask indices to batch positions via _draft_positions
  * Prevents silent corruption when kernel fallback is triggered

This completes the Python-side CoW implementation. CUDA kernel restore_rejected_drafts will be added next.
@yuz207 yuz207 deleted the terragon/optimize-scv-with-cudagraph-capture-zljdbl branch October 25, 2025 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants