Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 | ✅ | ✅ |
| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |

Expand Down Expand Up @@ -390,7 +390,7 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/kv_cache_routing.md
[planner]: docs/planner/sla_planner.md
[kvbm]: docs/kvbm/kvbm_architecture.md
[kvbm]: docs/kvbm/README.md
[mm]: examples/multimodal/
[migration]: docs/fault_tolerance/request_migration.md
[lora]: examples/backends/vllm/deploy/lora/README.md
Expand Down
2 changes: 1 addition & 1 deletion docs/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |
| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |


## Dynamo SGLang Integration
Expand Down
4 changes: 2 additions & 2 deletions docs/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |

### Large Scale P/D and WideEP Features

Expand Down Expand Up @@ -297,7 +297,7 @@ For detailed instructions on running comprehensive performance sweeps across bot

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .

## Known Issues and Mitigations

Expand Down
4 changes: 2 additions & 2 deletions docs/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | |
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |

### Large Scale P/D and WideEP Features
Expand Down
6 changes: 3 additions & 3 deletions docs/backends/vllm/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t

**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).

**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache_Integration.md).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache_integration.md).

**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).

Expand Down Expand Up @@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:"

Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:

- [LMCache Integration Guide](LMCache_Integration.md#troubleshooting)
- [LMCache Integration Guide](../../integrations/lmcache_integration.md#troubleshooting)

**For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache_Integration.md) - Setup and configuration
- [LMCache Integration Guide](../../integrations/lmcache_integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference

## Implementation Details
Expand Down
3 changes: 0 additions & 3 deletions docs/hidden_toctree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,6 @@
kubernetes/README.md
reference/cli.md
observability/metrics.md
kvbm/vllm-setup.md
kvbm/trtllm-setup.md
agents/tool-calling.md
development/jail_stream.md

Expand Down Expand Up @@ -77,7 +75,6 @@

backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
backends/vllm/LMCache_Integration.md
backends/vllm/multi-node.md
backends/vllm/prometheus.md
backends/vllm/prompt-embeddings.md
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Quickstart
:hidden:
:caption: User Guides

KV Cache Offloading <kvbm/kvbm_guide.md>
Tool Calling <agents/tool-calling.md>
Multimodality Support <multimodal/index.md>
Finding Best Initial Configs <performance/aiconfigurator.md>
Expand Down
219 changes: 219 additions & 0 deletions docs/integrations/flexkv_integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# FlexKV Integration in Dynamo

## Introduction

[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang.

### Key Features

- **Multi-level caching**: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
- **Distributed KV cache reuse**: Share KV cache across multiple nodes using distributed RadixTree
- **High-performance I/O**: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
- **Asynchronous operations**: Get and put operations can overlap with computation through prefetching


## Prerequisites

1. **Dynamo installed** with vLLM support
2. **Infrastructure services running**:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
3. **FlexKV dependencies** (for SSD offloading):
```bash
apt install liburing-dev libxxhash-dev
```

## Quick Start

### Enable FlexKV

Set the `DYNAMO_USE_FLEXKV` environment variable and use the `--connector flexkv` flag:

```bash
export DYNAMO_USE_FLEXKV=1
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```

## Aggregated Serving

### Basic Setup

```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &

# Terminal 2: Start vLLM worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```

### With KV-Aware Routing

For multi-worker deployments with KV-aware routing to maximize cache reuse:

```bash
# Terminal 1: Start frontend with KV router
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &

# Terminal 2: Worker 1
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
CUDA_VISIBLE_DEVICES=0 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &

# Terminal 3: Worker 2
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
```

## Disaggregated Serving

FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.

```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &

# Terminal 2: Decode worker (without FlexKV)
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl &

# Terminal 3: Prefill worker (with FlexKV)
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--connector nixl flexkv
```

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `DYNAMO_USE_FLEXKV` | Enable FlexKV integration | `0` (disabled) |
| `FLEXKV_CPU_CACHE_GB` | CPU memory cache size in GB | Required |
| `FLEXKV_CONFIG_PATH` | Path to FlexKV YAML config file | Not set |
| `FLEXKV_SERVER_RECV_PORT` | IPC port for FlexKV server | Auto |

### CPU-Only Offloading

For simple CPU memory offloading:

```bash
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
```

### CPU + SSD Tiered Offloading

For multi-tier offloading with SSD storage, create a configuration file:

```bash
cat > ./flexkv_config.yml <<EOF
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF

export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
```

### Configuration Options

| Option | Description |
|--------|-------------|
| `cpu_cache_gb` | CPU memory cache size in GB |
| `ssd_cache_gb` | SSD cache size in GB |
| `ssd_cache_dir` | SSD cache directories (semicolon-separated for multiple SSDs) |
| `enable_gds` | Enable GPU Direct Storage for SSD I/O |

> **Note:** For full configuration options, see the [FlexKV Configuration Reference](https://github.com/taco-project/FlexKV/blob/main/docs/flexkv_config_reference/README_en.md).

## Distributed KV Cache Reuse

FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:

- **Distributed RadixTree**: Each node maintains a local snapshot of the global index
- **Lease Mechanism**: Ensures data validity during cross-node transfers
- **RDMA-based Transfer**: Uses Mooncake Transfer Engine for high-performance KV cache transfer

For setup instructions, see the [FlexKV Distributed Reuse Guide](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md).

## Architecture

FlexKV consists of three core modules:

### StorageEngine

Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.

### GlobalCacheEngine

The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
- RadixTree for prefix matching
- Memory pool to track space usage and trigger eviction

### TransferEngine

The data plane that executes data transfers:
- Multi-threading for parallel transfers
- High-performance I/O (io_uring, GDS)
- Asynchronous operations overlapping with computation

## Verify Deployment

```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```

## See Also

- [FlexKV GitHub Repository](https://github.com/taco-project/FlexKV)
- [FlexKV vLLM Adapter Documentation](https://github.com/taco-project/FlexKV/blob/main/docs/vllm_adapter/README_en.md)

Loading
Loading