Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager #7

yuz207 · 2025-10-20T03:08:57Z

Summary

Introduces per-layer staged token tracking to enable vectorized NWOR commit across multiple layers within a single window.
Replaces the single _staged_tokens state with a per-layer mapping, allowing independent offset tracking per layer during staging.
Adds tests to validate multi-layer staging and correct slot_mapping emission in a vectorized commit scenario.

Changes

Core

vllm/v1/kv_cache/deferred.py
- Replace self._staged_tokens with self._layer_staged_tokens: dict[str, int] to track staged tokens per layer.
- begin_window now resets only per-layer state and does not touch a global staged counter.
- stage_layer now uses per-layer offsets to compute the start position for a given layer and updates the per-layer offset after staging.
- Commit path continues to enforce total token constraints, but now on a per-layer basis using _layer_staged_tokens.
- Deferred window reset clears _layer_staged_tokens instead of a global counter.

Tests

tests/v1/test_deferred_writer.py
- Added test_deferred_manager_multiple_layers_full_window to verify multi-layer staging within a single window:
  - Two layers stage writes with a shared slot_mapping, ensuring each layer receives the correct start slots [0, 1].
  - Commits appropriate tokens and returns metrics indicating committed, rejected, and fallback as expected for a multi-layer window.
- Existing tests continue to exercise partial acceptance and cancel flows, now compatible with per-layer staging.

Why

Enables vectorized NWOR commits by allowing multiple layers to be staged and committed within the same window without conflating their offsets.
Improves throughput for multi-layer KV cache writes by maintaining separate per-layer progress, leading to more predictable and efficient batching.

Test plan

Run: pytest tests/v1/test_deferred_writer.py
Verify:
- test_deferred_manager_multiple_layers_full_window passes and asserts correct per-layer slot mappings.
- Existing tests for partial acceptance and cancel flows pass with the new per-layer state.

Impact

API surface remains unchanged for external callers; internal state management now supports per-layer vectorized commits.
Minor risk if there are external expectations on a single staged-tokens counter; internal usage updated accordingly.

Documentation

No user-facing docs updated; internal behavior clarified by tests and in-code comments where applicable.

🌿 Generated by Terry

ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/cfd77c28-930a-43b3-8a66-f1042d7c7eee

- Implement commit_draft_kernel copying exact pattern from reshape_and_cache_flash - Support both NHD and HND cache layouts (Flash/Paged) - Full dtype dispatch: fp16/bf16/fp32 source, auto/fp8/fp8_e5m2 cache - Proper quantization with CopyWithScaleOp template - Per-token and scalar scale support - Mask early-return optimization (Issue #3) - TORCH_CHECK validation for all pointers (Issue #7) - Add key_value_dtype to DraftEntry for source dtype tracking This is Phase 3 (CUDA kernel) of the draft commit implementation. Next: PyTorch bindings + integration hooks.

Fix NWOR staging across layers

3f3054a

yuz207 marked this pull request as ready for review October 20, 2025 03:24

yuz207 merged commit e67d4cf into performance-fixes Oct 20, 2025

yuz207 deleted the terragon/optimize-nwor-commit-vectorization-godcn2 branch October 25, 2025 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager #7

Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager #7

Uh oh!

yuz207 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager #7

Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager #7

Uh oh!

Conversation

yuz207 commented Oct 20, 2025

Summary

Changes

Core

Tests

Why

Test plan

Impact

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants