Skip to content

[RFC]: OmniKVTransferManager - Centralized KV Cache Transfer Management #944

@princepride

Description

@princepride

Motivation.

1 Overview

Refactor the KV cache transfer logic by extracting duplicated code from GPUARModelRunner and GPUDiffusionModelRunner into a unified OmniKVTransferManager class.

1.1 Motivation

The current codebase has duplicated KV cache transfer logic across two model runners:

Component Role Duplicated Logic
GPUARModelRunner Sender Connector creation, config parsing, retry, KV extraction
GPUDiffusionModelRunner Receiver Connector creation, config parsing, polling

Problems:

  1. Maintenance burden: Bug fixes must be applied in multiple places. Rebase conflicts when adapting to new vLLM versions.
  2. Inconsistent behavior: Separate implementations may diverge over time.
  3. Poor separation of concerns: Model runners handle both execution AND transfer logic.

1.2 Target

In Scope:

  • Create OmniKVTransferManager with connector lifecycle, KV extraction, send/receive operations
  • Refactor GPUARModelRunner to delegate transfer operations to manager
  • Refactor GPUDiffusionModelRunner to delegate receive operations to manager

Out of Scope:

  • Changes to OmniConnector implementations
  • Scheduler modifications
  • Async transfer optimization

Accuracy: Data integrity preserved, backward compatible with existing configs

Performance: No regression


2 Design

2.1 Overview of Design

graph TB
    subgraph "Proposed Architecture"
        AR[GPUARModelRunner]
        DiT[GPUDiffusionModelRunner]
        KVM[OmniKVTransferManager]
        Conn[OmniConnector]
        
        AR -->|delegates| KVM
        DiT -->|delegates| KVM
        KVM --> Conn
    end
Loading

Component Responsibilities:

Component Responsibility
GPUARModelRunner Model execution, delegates KV transfer to manager
GPUDiffusionModelRunner Model execution, delegates KV receive to manager
OmniKVTransferManager Connector lifecycle, KV extraction, send with retry, receive with timeout
OmniConnector Low-level transport (Mooncake/SharedMemory)

Transfer Flow:

AR Scheduler → GPUARModelRunner → OmniKVTransferManager → OmniConnector
                                                            ↓
GPUDiffusionModelRunner ← OmniKVTransferManager ← ─ ─ ─ ─ ─ ─ ─

2.2 API Design

OmniKVTransferManager provides:

  • get_connector() - Lazy initialization of connector
  • extract_kv_cache() - Extract KV from GPU blocks
  • handle_finished_requests_kv_transfer() - Batch process finished requests
  • send_kv_cache() - Send with retry logic
  • receive_kv_cache() - Receive with timeout
  • Factory methods: from_model_config(), from_od_config()

Usage pattern:

  • AR runner calls handle_finished_requests_kv_transfer() after model execution
  • DiT runner calls receive_kv_cache() before model execution

3 Test Cases

Category Test Case Purpose
Unit test_connector_lazy_init Verify lazy initialization
Unit test_extract_kv_cache Verify KV extraction
Unit test_send_with_retry Verify retry logic
Unit test_receive_timeout Verify timeout handling
Integration test_ar_to_diffusion_transfer End-to-end KV transfer

4 Files Changed

File Change
vllm_omni/distributed/omni_connectors/omni_kv_cache_manager.py NEW
vllm_omni/worker/gpu_ar_model_runner.py MODIFY
vllm_omni/diffusion/worker/gpu_diffusion_model_runner.py MODIFY

Last Updated: Jan 25, 2026
Author: Wang Zhipeng

Proposed Change.

rfc.md

Feedback Period.

No response

CC List.

@hsliuustc0106 @ZJY0516 @tzhouam @natureofnature @Gaohan123

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions