Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Contributing to Mooncake
# Contributing to FlexKV

Thank you for your interest in contributing to FlexKV!

Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,18 @@ FlexKV is released under the **Apache-2.0 License**. See the [LICENSE](LICENSE)

## How to Use

### Install Dependencies

```bash
apt install liburing-dev
apt install libxxhash-dev
```

### Build FlexKV

```bash
./build.sh
#./build.sh --release for cython package
```

### Use FlexKV with vLLM
Expand Down
8 changes: 8 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,18 @@ FlexKV 采用 **Apache-2.0 开源协议**,详细信息请参见 [LICENSE](LICE

## 如何使用

### 安装依赖

```bash
apt install liburing-dev
apt install libxxhash-dev
```

### 编译 FlexKV

```bash
./build.sh
#./build.sh --release for cython package
```

### 以 vLLM 为例使用 FlexKV
Expand Down
4 changes: 3 additions & 1 deletion docs/dynamo_integration/README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ git apply /your/path/to/FlexKV/examples/vllm_adaption/vllm_0_10_1_1-flexkv-conne

### FlexKV Verification

Please refer to the test scripts in [vLLM online serving](https://github.com/taco-project/FlexKV/blob/dev/docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B).
Please refer to the test scripts in [vLLM online serving](../../docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B).

## 2. Dynamo Modifications

Expand Down Expand Up @@ -123,6 +123,8 @@ for i in $(seq 0 $((NUM_WORKERS-1))); do
done
```

> Note: The `flexkv_config.json` configuration is provided as a simple example only. For full parameter options, please refer to [`docs/flexkv_config_reference/README_en.md`](../../docs/flexkv_config_reference/README_en.md)

### Verification

You can verify that the Dynamo service has started correctly with the following command:
Expand Down
4 changes: 3 additions & 1 deletion docs/dynamo_integration/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ git apply /your/path/to/FlexKV/examples/vllm_adaption/vllm_0_10_1_1-flexkv-conne

### FlexKV 验证

请参考[vLLM online serving](https://github.com/taco-project/FlexKV/blob/dev/docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B)里的测试脚本。
请参考[vLLM online serving](../../docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B)里的测试脚本。


## 2. Dynamo 配置修改
Expand Down Expand Up @@ -124,6 +124,8 @@ for i in $(seq 0 $((NUM_WORKERS-1))); do
done
```

> 注:`flexkv_config.json`配置仅为简单示例,选项请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md)

### 验证

可通过如下命令验证Dynamo服务是否正确启动:
Expand Down
147 changes: 147 additions & 0 deletions docs/flexkv_config_reference/README_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# FlexKV Configuration Guide

This guide explains how to configure and use the FlexKV online serving configuration file (`flexkv_config.json`), including the meaning of all parameters, recommended values, and typical usage scenarios.

---

## Recommended Configuration

Below is a production-grade recommended configuration that balances performance and stability:

```json
{
"enable_flexkv": true,
"server_recv_port": "ipc:///tmp/flexkv_test",
"cache_config": {
"enable_cpu": true,
"enable_ssd": true,
"enable_remote": false,
"use_gds": false,
"enable_trace": false,
"ssd_cache_iouring_entries": 512,
"tokens_per_block": 64,
"num_cpu_blocks": 233000,
"num_ssd_blocks": 4096000,
"ssd_cache_dir": "/data/flexkv_ssd/",
"evict_ratio": 0.05,
"index_accel": true
},
"num_log_interval_requests": 2000
}
```
- `num_cpu_blocks` and `num_ssd_blocks` represent the total number of blocks in CPU memory and SSD respectively. These values must be configured according to your machine specs and model size. See [Cache Capacity Configuration](#cache-capacity-config) for calculation details.
- `ssd_cache_dir` specifies the directory where SSD-stored KV cache files are saved.

---

## Configuration File Structure Overview

The FlexKV configuration file is a JSON file, primarily consisting of three parts:

- `enable_flexkv`: Whether to enable FlexKV (must be set to `true` to take effect).
- `server_recv_port`: The IPC port on which the FlexKV service listens.
- `cache_config`: The core cache configuration object, containing all cache behavior parameters.
- `num_log_interval_requests`: Log statistics interval (outputs performance log every N requests).

---

## Complete `cache_config` Parameter Reference (from [`flexkv/common/config.py`](../../flexkv/common/config.py))

### Basic Configuration

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `tokens_per_block` | int | 16 | Number of tokens per KV block. Must match the `block_size` used in the acceleration framework (e.g., vLLM). |
| `enable_cpu` | bool | true | Whether to enable CPU memory as a cache layer. Strongly recommended to enable. |
| `enable_ssd` | bool | false | Whether to enable SSD as a cache layer. Recommended if NVMe SSD is available. |
| `enable_remote` | bool | false | Whether to enable remote cache (e.g., scalable cloud storage). Requires remote cache engine and custom implementation. |
| `use_gds` | bool | false | Whether to use GPU Direct Storage (GDS) to accelerate SSD I/O. Not currently supported. |
| `index_accel` | bool | false | Whether to enable C++ RadixTree. Recommended to enable. |

---

### KV Cache Layout Types (Generally No Need to Modify)

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `gpu_kv_layout_type` | enum | LAYERWISE | Organization of KV cache on GPU (layer-wise or block-wise). Must match vLLM’s layout (currently `LAYERWISE`). |
| `cpu_kv_layout_type` | enum | BLOCKWISE | Organization on CPU. Recommended to use `BLOCKWISE`. Does not need to match vLLM. |
| `ssd_kv_layout_type` | enum | BLOCKWISE | Organization on SSD. Recommended to use `BLOCKWISE`. Does not need to match vLLM. |
| `remote_kv_layout_type` | enum | BLOCKWISE | Organization for remote cache. Must be defined according to remote backend’s layout. |

> Note: Do not modify layout types unless you have specific performance requirements.

---

### Cache Capacity Configuration <a id="cache-capacity-config"></a>

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `num_cpu_blocks` | int | 1000000 | Number of blocks allocated in CPU memory. Adjust based on available RAM. |
| `num_ssd_blocks` | int | 10000000 | Number of blocks allocated on SSD. |
| `num_remote_blocks` | int \| None | None | Number of blocks allocated in remote cache. |

> Note: Block size in all cache levels (CPU/SSD/Remote) matches the GPU block size. Estimate cache capacities based on GPU KV cache memory usage and block count.

> Note: `block_size = num_layer * _kv_dim * tokens_per_block * num_head * head_size * dtype_size`.

---

### CPU-GPU Transfer Optimization

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `use_ce_transfer_h2d` | bool | false | Whether to use CUDA Copy Engine for Host→Device transfers. Reduces SM usage but may slightly reduce bandwidth. Real-world difference is minimal. |
| `use_ce_transfer_d2h` | bool | false | Whether to use CUDA Copy Engine for Device→Host transfers. |
| `transfer_sms_h2d` | int | 8 | Number of SMs (Streaming Multiprocessors) allocated for H2D transfers. |
| `transfer_sms_d2h` | int | 8 | Number of SMs allocated for D2H transfers. |

---

### SSD Cache Configuration

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `max_blocks_per_file` | int | 32000 | Maximum number of blocks per SSD file. `-1` means unlimited. |
| `ssd_cache_dir` | str \| List[str] | None | **Required.** Path to SSD cache directory, e.g., `"/data/flexkv_ssd/"`. |
| `ssd_cache_iouring_entries` | int | 0 | io_uring queue depth. Recommended: `512` for significantly improved concurrent I/O performance. |
| `ssd_cache_iouring_flags` | int | 0 | io_uring flags. Keep as `0` in most cases. |

> Note: To maximize bandwidth across multiple SSDs, bind each SSD to a separate directory and specify them as a list:
> `"ssd_cache_dir": ["/data0/flexkv_ssd/", "/data1/flexkv_ssd/"]`.
> KV blocks will be evenly distributed across all SSDs.

> Note: Setting `ssd_cache_iouring_entries` to `0` disables io_uring. Not recommended.

---

### Remote Cache Configuration (Skip if not enabled)

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `remote_cache_size_mode` | str | "file_size" | Allocate remote cache space by file size or block count. |
| `remote_file_size` | int \| None | None | Size (in bytes) of each remote file. |
| `remote_file_num` | int \| None | None | Number of remote files. |
| `remote_file_prefix` | str \| None | None | Prefix for remote file names. |
| `remote_cache_path` | str \| List[str] | None | Remote cache path (e.g., Redis URL, S3 path). |
| `remote_config_custom` | dict \| None | None | Custom remote cache configurations (e.g., timeout, authentication). |

---

### Tracing and Logging

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `enable_trace` | bool | true | Whether to enable performance tracing. Disable (`false`) in production to reduce overhead. |
| `trace_file_path` | str | "./flexkv_trace.log" | Path to trace log file. |
| `trace_max_file_size_mb` | int | 100 | Maximum size (MB) per trace log file. |
| `trace_max_files` | int | 5 | Maximum number of trace log files to retain. |
| `trace_flush_interval_ms` | int | 1000 | Trace log flush interval (milliseconds). |

---

### Cache Eviction Policy

| Parameter Name | Type | Default | Description |
|----------------|------|---------|-------------|
| `evict_ratio` | float | 0.0 | Ratio of blocks to proactively evict from CPU/SSD per eviction cycle. `0.0` = evict only the minimal necessary blocks (more eviction cycles may impact performance). Recommended: `0.05` (evict 5% of least recently used blocks per cycle). |
145 changes: 145 additions & 0 deletions docs/flexkv_config_reference/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# FlexKV 配置使用指南

本指南详细说明如何配置和使用 FlexKV 的在线服务配置文件(`flexkv_config.json`),涵盖所有参数的含义、推荐值及典型使用场景。

---

## 推荐配置方案

以下是一个兼顾性能与稳定性的生产级推荐配置:

```json
{
"enable_flexkv": true,
"server_recv_port": "ipc:///tmp/flexkv_test",
"cache_config": {
"enable_cpu": true,
"enable_ssd": true,
"enable_remote": false,
"use_gds": false,
"enable_trace": false,
"ssd_cache_iouring_entries": 512,
"tokens_per_block": 64,
"num_cpu_blocks": 233000,
"num_ssd_blocks": 4096000,
"ssd_cache_dir": "/data/flexkv_ssd/",
"evict_ratio": 0.05,
"index_accel": true
},
"num_log_interval_requests": 2000
}
```
- 其中的`num_cpu_blocks`和`num_ssd_blocks`分别代表内存和SSD中block的总数量,需要根据实际机器配置和模型来配置,具体计算方式见下文[缓存容量配置](#cache-capacity-config)
- `ssd_cache_dir`为ssd中KVCache存放的文件目录

---

## 配置文件结构概览

FlexKV 的配置文件是一个 JSON 文件,主要包含三个部分:

- `enable_flexkv`: 是否启用 FlexKV 功能(必须设为 `true` 才生效)
- `server_recv_port`: FlexKV 服务监听的 IPC 端口
- `cache_config`: 核心缓存配置对象,包含所有缓存行为参数
- `num_log_interval_requests`: 日志统计间隔(每处理 N 个请求输出一次性能日志)

---

## cache_config完整参数详解(来自 [`flexkv/common/config.py`](../../flexkv/common/config.py))

### 基础配置

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `tokens_per_block` | int | 16 | 每个 KV Block 包含的 token 数量。需要与加速框架(如vLLM)中`block_size`保持一致 |
| `enable_cpu` | bool | true | 是否启用 CPU 内存作为缓存层。强烈建议开启。 |
| `enable_ssd` | bool | false | 是否启用 SSD 作为缓存层。如配备 NVMe SSD,建议开启。 |
| `enable_remote` | bool | false | 是否启用远程缓存(如可扩展云存储等)。需要配合远程缓存和自定义的远程缓存引擎使用 |
| `use_gds` | bool | false | 是否使用 GPU Direct Storage(GDS)加速 SSD 读写。目前暂不支持。 |
| `index_accel` | bool | false | 是否启用C++ RadixTree。推荐开启。 |

---

### KV 缓存布局类型(一般无需修改)

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `gpu_kv_layout_type` | enum | LAYERWISE | GPU 上 KV Cache 的组织方式(按层或按块)。目前vLLM在GPU组织方式为`LAYERWISE`,因此FlexKV的`gpu_kv_layout_type`须与vLLM保持一致 |
| `cpu_kv_layout_type` | enum | BLOCKWISE | CPU 上按块组织, 推荐使用`BLOCKWISE`,不需要与vLLM保持一致 |
| `ssd_kv_layout_type` | enum | BLOCKWISE | SSD 上按块组织, 推荐使用`BLOCKWISE`,不需要与vLLM保持一致 |
| `remote_kv_layout_type` | enum | BLOCKWISE | 远程缓存按块组织, 需要按照remote组织形式定义 |

> 注:除非有特殊性能需求,否则不建议修改布局类型。

---

### 缓存容量配置 <a id="cache-capacity-config"></a>

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `num_cpu_blocks` | int | 1000000 | CPU 缓存块数。根据内存大小调整。|
| `num_ssd_blocks` | int | 10000000 | SSD 缓存块数。|
| `num_remote_blocks` | int \| None | None | 远程缓存块数。|

> 注:FlexKV里的各级缓存的block大小与GPU中的block大小保持一致,可以参考GPU的KVCache显存大小与block数量估算各级缓存中的block数量。

> 注:block_size = num_layer * _kv_dim * tokens_per_block * num_head * self.head_size * torch_dtype.size()。

---

### CPU-GPU 传输优化

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `use_ce_transfer_h2d` | bool | false | 是否使用 cuda copy engine 优化 Host→Device 传输,使用CE可以减少GPU SM在传输上的使用,但是传输速度会降低,实际测试差距不大 |
| `use_ce_transfer_d2h` | bool | false | 是否使用 cuda copy engine 优化 Device→Host 传输 |
| `transfer_sms_h2d` | int | 8 | H2D 传输使用的流处理器数量 |
| `transfer_sms_d2h` | int | 8 | D2H 传输使用的流处理器数量 |

---

### SSD 缓存配置

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `max_blocks_per_file` | int | 32000 | 单个 SSD 文件最多包含的 block 数。-1 表示无限制 |
| `ssd_cache_dir` | str \| List[str] | None | SSD 缓存目录路径,**必须设置**,如 `"/data/flexkv_ssd/"` |
| `ssd_cache_iouring_entries` | int | 0 | io_uring 队列深度,推荐设为 `512` 以提升并发 IO 性能,实测比不使用iouring提升较大,推荐使用512 |
| `ssd_cache_iouring_flags` | int | 0 | io_uring 标志位,一般保持 0 |

> 注:为了充分利用多块SSD的带宽上限,可以将多块SSD绑定至不同目录,并使用如 `"ssd cache dir": ["/data0/flexkv_ssd/", "/data1/flexkv_ssd/"]`方式初始化,SSD KVCache会均匀分布在所有SSD中,充分利用多个SSD带宽。

> 注:`ssd_cache_iouring_entries`设置为0即不适用iouring,不推荐设置为0

---

### 远程缓存配置(不启用时无需配置)

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `remote_cache_size_mode` | str | "file_size" | 按文件大小或块数分配远程缓存空间 |
| `remote_file_size` | int \| None | None | 单个远程文件大小(字节) |
| `remote_file_num` | int \| None | None | 远程文件数量 |
| `remote_file_prefix` | str \| None | None | 远程文件名前缀 |
| `remote_cache_path` | str \| List[str] | None | 远程缓存路径(如 Redis URL、S3 路径等) |
| `remote_config_custom` | dict \| None | None | 自定义远程缓存配置(如超时、认证等) |

---

### 追踪与日志

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `enable_trace` | bool | true | 是否启用性能追踪。生产环境建议关闭(`false`)以减少开销 |
| `trace_file_path` | str | "./flexkv_trace.log" | 追踪日志路径 |
| `trace_max_file_size_mb` | int | 100 | 单个追踪文件最大大小(MB) |
| `trace_max_files` | int | 5 | 最多保留的追踪文件数 |
| `trace_flush_interval_ms` | int | 1000 | 追踪日志刷新间隔(毫秒) |

---

### 缓存淘汰策略

| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `evict_ratio` | float | 0.0 | cpu,ssd一次evict主动淘汰比例(0.0 = 只淘汰最小的必要的block数量,较多的淘汰次数会影响性能)。建议保持 `0.05`,即每一次淘汰5%的最久未使用的block |
2 changes: 2 additions & 0 deletions docs/vllm_adapter/README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ VLLM_USE_V1=1 python -m vllm.entrypoints.cli.main serve Qwen3/Qwen3-32B \

```

> Note: The `flexkv_config.json` configuration is provided as a simple example only. For full parameter options, please refer to [`docs/flexkv_config_reference/README_en.md`](../../docs/flexkv_config_reference/README_en.md)

## Legacy Version (<= 0.1.0) – Not Recommended for Current Use

### Supported Versions
Expand Down
2 changes: 2 additions & 0 deletions docs/vllm_adapter/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ VLLM_USE_V1=1 python -m vllm.entrypoints.cli.main serve Qwen3/Qwen3-32B \

```

> 注:`flexkv_config.json`配置仅为简单示例,选项请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md)

## Legacy版本(<= 0.1.0),目前的版本尽量不要使用

### 适用版本
Expand Down