-
Notifications
You must be signed in to change notification settings - Fork 460
Closed
Description
Motivation.
1 Overview
Refactor the KV cache transfer logic by extracting duplicated code from GPUARModelRunner and GPUDiffusionModelRunner into a unified OmniKVTransferManager class.
1.1 Motivation
The current codebase has duplicated KV cache transfer logic across two model runners:
| Component | Role | Duplicated Logic |
|---|---|---|
| GPUARModelRunner | Sender | Connector creation, config parsing, retry, KV extraction |
| GPUDiffusionModelRunner | Receiver | Connector creation, config parsing, polling |
Problems:
- Maintenance burden: Bug fixes must be applied in multiple places. Rebase conflicts when adapting to new vLLM versions.
- Inconsistent behavior: Separate implementations may diverge over time.
- Poor separation of concerns: Model runners handle both execution AND transfer logic.
1.2 Target
In Scope:
- Create OmniKVTransferManager with connector lifecycle, KV extraction, send/receive operations
- Refactor GPUARModelRunner to delegate transfer operations to manager
- Refactor GPUDiffusionModelRunner to delegate receive operations to manager
Out of Scope:
- Changes to OmniConnector implementations
- Scheduler modifications
- Async transfer optimization
Accuracy: Data integrity preserved, backward compatible with existing configs
Performance: No regression
2 Design
2.1 Overview of Design
graph TB
subgraph "Proposed Architecture"
AR[GPUARModelRunner]
DiT[GPUDiffusionModelRunner]
KVM[OmniKVTransferManager]
Conn[OmniConnector]
AR -->|delegates| KVM
DiT -->|delegates| KVM
KVM --> Conn
end
Component Responsibilities:
| Component | Responsibility |
|---|---|
| GPUARModelRunner | Model execution, delegates KV transfer to manager |
| GPUDiffusionModelRunner | Model execution, delegates KV receive to manager |
| OmniKVTransferManager | Connector lifecycle, KV extraction, send with retry, receive with timeout |
| OmniConnector | Low-level transport (Mooncake/SharedMemory) |
Transfer Flow:
AR Scheduler → GPUARModelRunner → OmniKVTransferManager → OmniConnector
↓
GPUDiffusionModelRunner ← OmniKVTransferManager ← ─ ─ ─ ─ ─ ─ ─
2.2 API Design
OmniKVTransferManager provides:
get_connector()- Lazy initialization of connectorextract_kv_cache()- Extract KV from GPU blockshandle_finished_requests_kv_transfer()- Batch process finished requestssend_kv_cache()- Send with retry logicreceive_kv_cache()- Receive with timeout- Factory methods:
from_model_config(),from_od_config()
Usage pattern:
- AR runner calls
handle_finished_requests_kv_transfer()after model execution - DiT runner calls
receive_kv_cache()before model execution
3 Test Cases
| Category | Test Case | Purpose |
|---|---|---|
| Unit | test_connector_lazy_init | Verify lazy initialization |
| Unit | test_extract_kv_cache | Verify KV extraction |
| Unit | test_send_with_retry | Verify retry logic |
| Unit | test_receive_timeout | Verify timeout handling |
| Integration | test_ar_to_diffusion_transfer | End-to-end KV transfer |
4 Files Changed
| File | Change |
|---|---|
| vllm_omni/distributed/omni_connectors/omni_kv_cache_manager.py | NEW |
| vllm_omni/worker/gpu_ar_model_runner.py | MODIFY |
| vllm_omni/diffusion/worker/gpu_diffusion_model_runner.py | MODIFY |
Last Updated: Jan 25, 2026
Author: Wang Zhipeng
Proposed Change.
Feedback Period.
No response
CC List.
@hsliuustc0106 @ZJY0516 @tzhouam @natureofnature @Gaohan123
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels