Skip to content

Conversation

@yuz207
Copy link

@yuz207 yuz207 commented Oct 20, 2025

Summary

  • Introduces per-layer staged token tracking to enable vectorized NWOR commit across multiple layers within a single window.
  • Replaces the single _staged_tokens state with a per-layer mapping, allowing independent offset tracking per layer during staging.
  • Adds tests to validate multi-layer staging and correct slot_mapping emission in a vectorized commit scenario.

Changes

Core

  • vllm/v1/kv_cache/deferred.py
    • Replace self._staged_tokens with self._layer_staged_tokens: dict[str, int] to track staged tokens per layer.
    • begin_window now resets only per-layer state and does not touch a global staged counter.
    • stage_layer now uses per-layer offsets to compute the start position for a given layer and updates the per-layer offset after staging.
    • Commit path continues to enforce total token constraints, but now on a per-layer basis using _layer_staged_tokens.
    • Deferred window reset clears _layer_staged_tokens instead of a global counter.

Tests

  • tests/v1/test_deferred_writer.py
    • Added test_deferred_manager_multiple_layers_full_window to verify multi-layer staging within a single window:
      • Two layers stage writes with a shared slot_mapping, ensuring each layer receives the correct start slots [0, 1].
      • Commits appropriate tokens and returns metrics indicating committed, rejected, and fallback as expected for a multi-layer window.
    • Existing tests continue to exercise partial acceptance and cancel flows, now compatible with per-layer staging.

Why

  • Enables vectorized NWOR commits by allowing multiple layers to be staged and committed within the same window without conflating their offsets.
  • Improves throughput for multi-layer KV cache writes by maintaining separate per-layer progress, leading to more predictable and efficient batching.

Test plan

  • Run: pytest tests/v1/test_deferred_writer.py
  • Verify:
    • test_deferred_manager_multiple_layers_full_window passes and asserts correct per-layer slot mappings.
    • Existing tests for partial acceptance and cancel flows pass with the new per-layer state.

Impact

  • API surface remains unchanged for external callers; internal state management now supports per-layer vectorized commits.
  • Minor risk if there are external expectations on a single staged-tokens counter; internal usage updated accordingly.

Documentation

  • No user-facing docs updated; internal behavior clarified by tests and in-code comments where applicable.

🌿 Generated by Terry


ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/cfd77c28-930a-43b3-8a66-f1042d7c7eee

@yuz207 yuz207 marked this pull request as ready for review October 20, 2025 03:24
@yuz207 yuz207 merged commit e67d4cf into performance-fixes Oct 20, 2025
yuz207 added a commit that referenced this pull request Oct 22, 2025
- Implement commit_draft_kernel copying exact pattern from reshape_and_cache_flash
- Support both NHD and HND cache layouts (Flash/Paged)
- Full dtype dispatch: fp16/bf16/fp32 source, auto/fp8/fp8_e5m2 cache
- Proper quantization with CopyWithScaleOp template
- Per-token and scalar scale support
- Mask early-return optimization (Issue #3)
- TORCH_CHECK validation for all pointers (Issue #7)
- Add key_value_dtype to DraftEntry for source dtype tracking

This is Phase 3 (CUDA kernel) of the draft commit implementation.
Next: PyTorch bindings + integration hooks.
@yuz207 yuz207 deleted the terragon/optimize-nwor-commit-vectorization-godcn2 branch October 25, 2025 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants