[RFC]: OmniKVTransferManager - Centralized KV Cache Transfer Management

### Motivation.

## 1 Overview

Refactor the KV cache transfer logic by extracting duplicated code from GPUARModelRunner and GPUDiffusionModelRunner into a unified OmniKVTransferManager class.

### 1.1 Motivation

The current codebase has **duplicated KV cache transfer logic** across two model runners:

| Component | Role | Duplicated Logic |
|-----------|------|------------------|
| GPUARModelRunner | Sender | Connector creation, config parsing, retry, KV extraction |
| GPUDiffusionModelRunner | Receiver | Connector creation, config parsing, polling |

**Problems:**

1. **Maintenance burden**: Bug fixes must be applied in multiple places. Rebase conflicts when adapting to new vLLM versions.
2. **Inconsistent behavior**: Separate implementations may diverge over time.
3. **Poor separation of concerns**: Model runners handle both execution AND transfer logic.

### 1.2 Target

**In Scope:**
- Create OmniKVTransferManager with connector lifecycle, KV extraction, send/receive operations
- Refactor GPUARModelRunner to delegate transfer operations to manager
- Refactor GPUDiffusionModelRunner to delegate receive operations to manager

**Out of Scope:**
- Changes to OmniConnector implementations
- Scheduler modifications
- Async transfer optimization

**Accuracy:** Data integrity preserved, backward compatible with existing configs

**Performance:** No regression

---

## 2 Design

### 2.1 Overview of Design

```mermaid
graph TB
    subgraph "Proposed Architecture"
        AR[GPUARModelRunner]
        DiT[GPUDiffusionModelRunner]
        KVM[OmniKVTransferManager]
        Conn[OmniConnector]
        
        AR -->|delegates| KVM
        DiT -->|delegates| KVM
        KVM --> Conn
    end
```

**Component Responsibilities:**

| Component | Responsibility |
|-----------|----------------|
| GPUARModelRunner | Model execution, delegates KV transfer to manager |
| GPUDiffusionModelRunner | Model execution, delegates KV receive to manager |
| OmniKVTransferManager | Connector lifecycle, KV extraction, send with retry, receive with timeout |
| OmniConnector | Low-level transport (Mooncake/SharedMemory) |

**Transfer Flow:**

```
AR Scheduler → GPUARModelRunner → OmniKVTransferManager → OmniConnector
                                                            ↓
GPUDiffusionModelRunner ← OmniKVTransferManager ← ─ ─ ─ ─ ─ ─ ─
```

### 2.2 API Design

**OmniKVTransferManager** provides:
- `get_connector()` - Lazy initialization of connector
- `extract_kv_cache()` - Extract KV from GPU blocks
- `handle_finished_requests_kv_transfer()` - Batch process finished requests
- `send_kv_cache()` - Send with retry logic
- `receive_kv_cache()` - Receive with timeout
- Factory methods: `from_model_config()`, `from_od_config()`

**Usage pattern:**
- AR runner calls `handle_finished_requests_kv_transfer()` after model execution
- DiT runner calls `receive_kv_cache()` before model execution

---

## 3 Test Cases

| Category | Test Case | Purpose |
|----------|-----------|---------|
| Unit | test_connector_lazy_init | Verify lazy initialization |
| Unit | test_extract_kv_cache | Verify KV extraction |
| Unit | test_send_with_retry | Verify retry logic |
| Unit | test_receive_timeout | Verify timeout handling |
| Integration | test_ar_to_diffusion_transfer | End-to-end KV transfer |

---

## 4 Files Changed

| File | Change |
|------|--------|
| vllm_omni/distributed/omni_connectors/omni_kv_cache_manager.py | NEW |
| vllm_omni/worker/gpu_ar_model_runner.py | MODIFY |
| vllm_omni/diffusion/worker/gpu_diffusion_model_runner.py | MODIFY |

**Last Updated**: Jan 25, 2026  
**Author**: Wang Zhipeng

### Proposed Change.

[rfc.md](https://github.com/user-attachments/files/24845725/rfc.md)


### Feedback Period.

_No response_

### CC List.

@hsliuustc0106 @ZJY0516 @tzhouam @natureofnature @Gaohan123 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: OmniKVTransferManager - Centralized KV Cache Transfer Management #944

Motivation.

1 Overview

1.1 Motivation

1.2 Target

2 Design

2.1 Overview of Design

2.2 API Design

3 Test Cases

4 Files Changed

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Role	Duplicated Logic
GPUARModelRunner	Sender	Connector creation, config parsing, retry, KV extraction
GPUDiffusionModelRunner	Receiver	Connector creation, config parsing, polling

Component	Responsibility
GPUARModelRunner	Model execution, delegates KV transfer to manager
GPUDiffusionModelRunner	Model execution, delegates KV receive to manager
OmniKVTransferManager	Connector lifecycle, KV extraction, send with retry, receive with timeout
OmniConnector	Low-level transport (Mooncake/SharedMemory)

Category	Test Case	Purpose
Unit	test_connector_lazy_init	Verify lazy initialization
Unit	test_extract_kv_cache	Verify KV extraction
Unit	test_send_with_retry	Verify retry logic
Unit	test_receive_timeout	Verify timeout handling
Integration	test_ar_to_diffusion_transfer	End-to-end KV transfer

File	Change
vllm_omni/distributed/omni_connectors/omni_kv_cache_manager.py	NEW
vllm_omni/worker/gpu_ar_model_runner.py	MODIFY
vllm_omni/diffusion/worker/gpu_diffusion_model_runner.py	MODIFY

[RFC]: OmniKVTransferManager - Centralized KV Cache Transfer Management #944

Description

Motivation.

1 Overview

1.1 Motivation

1.2 Target

2 Design

2.1 Overview of Design

2.2 API Design

3 Test Cases

4 Files Changed

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions