diff --git a/README_zh.md b/README_zh.md
index 0618a83220..1654ff17ae 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -8,16 +8,28 @@ FlexKV 采用 **Apache-2.0 开源协议**,详细信息请参见 [LICENSE](LICE
## 如何使用
+### 安装依赖
+
+```bash
+apt install liburing-dev
+apt install libxxhash-dev
+```
+
### 编译 FlexKV
```bash
./build.sh
+#./build.sh --release for cython package
```
### 以 vLLM 为例使用 FlexKV
见[docs/vllm_adapter/README_zh.md](docs/vllm_adapter/README_zh.md)
+### FlexKV和Dynamo框架的集成
+
+见[docs/dynamo_integration/README_zh.md](docs/dynamo_integration/README_zh.md)
+
## 设计框架
diff --git a/docs/dynamo_integration/README_en.md b/docs/dynamo_integration/README_en.md
new file mode 100644
index 0000000000..6f3988e23e
--- /dev/null
+++ b/docs/dynamo_integration/README_en.md
@@ -0,0 +1,155 @@
+# FlexKV and Dynamo Integration Guide
+
+This document demonstrates how to integrate FlexKV with NVIDIA's [Dynamo](https://github.com/ai-dynamo/dynamo) framework and complete performance testing.
+
+Dynamo is a framework designed by NVIDIA for large-scale distributed deployment, supporting multiple backend engines including TensorRT-LLM, vLLM, and SGLang. The KV Router is an intelligent request routing component that tracks and manages KV caches stored on different workers. It intelligently assigns requests to the most suitable worker based on the overlap between requests and KV cache, as well as the current worker load, thereby reducing expensive KV cache recomputations and improving inference efficiency. This document also explains how to integrate FlexKV into Dynamo when the KV Router is enabled.
+
+## 1. Environment Setup
+
+### Dynamo Image
+
+We use Dynamo 0.4.1 image with vLLM backend, which includes vLLM 0.10.1.1.
+
+```bash
+docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
+```
+
+### FlexKV Code Preparation
+
+```bash
+git clone https://github.com/taco-project/FlexKV
+```
+
+### Install FlexKV
+
+```bash
+apt update && apt install liburing-dev
+
+cd FlexKV && ./build.sh
+```
+
+### vLLM Apply Patch
+
+```bash
+# Navigate to vLLM directory
+cd /opt/vllm
+# apply patch
+git apply /your/path/to/FlexKV/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+```
+
+### FlexKV Verification
+
+Please refer to the test scripts in [vLLM online serving](../../docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B).
+
+## 2. Dynamo Modifications
+
+### kv_transfer_config
+
+To integrate with FlexKV, you need to modify the kv_transfer_config in the Dynamo image. Change lines 245-248 in /opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/args.py to:
+
+```python
+kv_transfer_config = KVTransferConfig(
+ kv_connector="FlexKVConnectorV1", kv_role="kv_both"
+)
+logger.info("Using FlexKVConnectorV1 configuration")
+```
+
+### CPU Offloading
+
+In Dynamo, the KV router updates its KV index by receiving events sent from workers, allowing it to track the KV cache status on each worker. When CPU offloading is enabled in FlexKV, we remove [BlockRemove](https://github.com/vllm-project/vllm/blob/v0.10.1.1/vllm/v1/core/block_pool.py#L221) in vLLM, allowing FlexKV to cache all KV blocks through CPU during the serving process. This ensures that the index maintained by the KV router accurately reflects the actual index in FlexKV.
+
+## 3. Starting and Verifying Dynamo Services
+
+### Starting Dynamo + FlexKV
+
+```bash
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# Start nats and etcd
+nats-server -js &
+
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+
+sleep 3
+
+# run ingress, set routing mode with --router-mode, options include kv, round-robin, random
+python -m dynamo.frontend --router-mode kv --http-port 8000 &
+
+# Define number of worker nodes
+NUM_WORKERS=4
+
+# When using multiple workers, ensure FlexKV ports are different to avoid hanging at flexkv init
+# Adjust num_cpu_blocks and num_ssd_blocks values according to your server configuration
+for i in $(seq 0 $((NUM_WORKERS-1))); do
+ cat <
./flexkv_config_${i}.json
+{
+ "enable_flexkv": true,
+ "server_recv_port": "ipc:///tmp/flexkv_${i}_test",
+ "cache_config": {
+ "enable_cpu": true,
+ "enable_ssd": false,
+ "enable_remote": false,
+ "use_gds": false,
+ "enable_trace": false,
+ "ssd_cache_iouring_entries": 512,
+ "tokens_per_block": 64,
+ "num_cpu_blocks": 10240,
+ "num_ssd_blocks": 256000,
+ "ssd_cache_dir": "/data/flexkv_ssd/",
+ "evict_ratio": 0.05,
+ "index_accel": true
+
+ },
+ "num_log_interval_requests": 200
+}
+EOF
+done
+
+# Use a loop to start worker nodes
+for i in $(seq 0 $((NUM_WORKERS-1))); do
+ # Calculate GPU device IDs
+ GPU_START=$((i*2))
+ GPU_END=$((i*2+1))
+
+ if [ $i -lt $((NUM_WORKERS-1)) ]; then
+ FLEXKV_CONFIG_PATH="./flexkv_config_${i}.json" CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} python3 -m dynamo.vllm --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tensor_parallel_size 2 --block-size 64 --gpu-memory-utilization 0.9 --max-model-len 100310 &
+ else
+ FLEXKV_CONFIG_PATH="./flexkv_config_${i}.json" CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} python3 -m dynamo.vllm --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tensor_parallel_size 2 --block-size 64 --gpu-memory-utilization 0.9 --max-model-len 100310
+ fi
+done
+```
+
+> Note: The `flexkv_config.json` configuration is provided as a simple example only. For full parameter options, please refer to [`docs/flexkv_config_reference/README_en.md`](../../docs/flexkv_config_reference/README_en.md)
+
+### Verification
+
+You can verify that the Dynamo service has started correctly with the following command:
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
+ "messages": [
+ {
+ "role": "user",
+ "content": "Tell me a joke."
+ }
+ ],
+ "stream":false,
+ "max_tokens": 30
+ }'
+```
+
+## 4. Benchmark
+
+We use [genai-perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf) as our benchmark tool and [mooncake trace](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#-open-source-trace) as our dataset to evaluate the performance of Dynamo + FlexKV.
+
+Mooncake Trace is an open-source request file saved in jsonl format. It records timestamps of request arrivals, ISL, OSL, and KV cache-related hash IDs, containing 23,608 requests over a 1-hour period. For our experiment with 4 LLaMA-70B workers, the concurrency in the mooncake trace was too high, so we sampled every 6th request from the trace to build our benchmark dataset.
+
+genai-perf can send requests according to the timestamps in the trace file and calculate metrics such as TTFT (Time To First Token) and TPOT (Tokens Per Output Token) for the LLM service. The command is as follows. Please use genai-perf==0.0.13, as newer versions have a bug in timestamp parsing.
+
+```bash
+genai-perf profile --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --endpoint-type chat --endpoint /v1/chat/completions --streaming --url http://localhost:8000 --input-file payload:mooncake_trace_1_6.jsonl --random-seed 100 -v -H 'Authorization: Bearer NOT USED' -H 'Accept: text/event-stream' -- --stability-percentage 99
+```
\ No newline at end of file
diff --git a/docs/dynamo_integration/README_zh.md b/docs/dynamo_integration/README_zh.md
new file mode 100644
index 0000000000..651b0d9aef
--- /dev/null
+++ b/docs/dynamo_integration/README_zh.md
@@ -0,0 +1,155 @@
+# FlexKV 与 Dynamo 集成指南
+
+该文档展示了如何将FlexKV和NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) 框架集成,并完成性能测试的步骤。
+
+Dynamo是NVIDIA专为大规模分离式部署而设计的框架,支持TensorRT-LLM, vLLM, SGLang等多个后端引擎。其中KV 路由器(KV Router)是一个智能的请求路由组件, 它能够追踪和管理存储在不同worker上的 KV cache,并根据请求与缓存的重叠程度和worker当前负载,智能地将请求分配给最合适的 GPU 节点,从而减少昂贵的 KV 缓存重新计算,提高推理效率。文档也介绍了如何在开启KV Router时,将FlexKV集成进Dynamo。
+
+## 1. 环境准备
+
+### Dynamo 镜像
+
+该文档使用的是后端为vLLM的Dynamo 0.4.1 镜像,内置了vLLM 0.10.1.1。
+
+```bash
+docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
+```
+
+### FlexKV代码准备
+
+```bash
+git clone https://github.com/taco-project/FlexKV
+```
+
+### 安装 FlexKV
+
+```bash
+apt update && apt install liburing-dev
+
+cd FlexKV && ./build.sh
+```
+
+### vLLM Apply Patch
+
+```bash
+# 进入 vLLM 目录
+cd /opt/vllm
+# apply patch
+git apply /your/path/to/FlexKV/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+```
+
+### FlexKV 验证
+
+请参考[vLLM online serving](../../docs/vllm_adapter/README_zh.md#%E7%A4%BA%E4%BE%8B)里的测试脚本。
+
+
+## 2. Dynamo 配置修改
+
+### kv_transfer_config
+
+为了和FlexKV集成,需要修改Dynamo镜像内的kv_transfer_config。将/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/args.py 的245-248行修改为:
+
+```python
+kv_transfer_config = KVTransferConfig(
+ kv_connector="FlexKVConnectorV1", kv_role="kv_both"
+)
+logger.info("Using FlexKVConnectorV1 configuration")
+```
+
+### CPU Offloading
+
+在Dynamo中,KV router通过接收worker发送的event来更新KV index,从而感知每个worker上的KV cache情况。当FlexKV开启CPU offloading时,我们删掉vLLM里[BlockRemove](https://github.com/vllm-project/vllm/blob/v0.10.1.1/vllm/v1/core/block_pool.py#L221),让FlexKV通过CPU能够缓存住所有serving过程中的KV block,这样KV router维护的index就能反映FlexKV的真实index了。
+
+## 3. 启动和验证Dynamo服务
+
+### 启动Dynamo + FlexKV
+
+```bash
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -e
+trap 'echo Cleaning up...; kill 0' EXIT
+
+# 启动nats和etcd
+nats-server -js &
+
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+
+sleep 3
+
+# run ingress, 通过--router-mode设置路由方式,可选项为kv, round-robin, random
+python -m dynamo.frontend --router-mode kv --http-port 8000 &
+
+# 定义工作节点数量
+NUM_WORKERS=4
+
+# 多个worker时注意FlexKV的端口应不同,否则会卡在flexkv init这一步
+# 请根据服务器的配置,调整num_cpu_blocks和num_ssd_blocks的数值
+for i in $(seq 0 $((NUM_WORKERS-1))); do
+ cat < ./flexkv_config_${i}.json
+{
+ "enable_flexkv": true,
+ "server_recv_port": "ipc:///tmp/flexkv_${i}_test",
+ "cache_config": {
+ "enable_cpu": true,
+ "enable_ssd": false,
+ "enable_remote": false,
+ "use_gds": false,
+ "enable_trace": false,
+ "ssd_cache_iouring_entries": 512,
+ "tokens_per_block": 64,
+ "num_cpu_blocks": 10240,
+ "num_ssd_blocks": 256000,
+ "ssd_cache_dir": "/data/flexkv_ssd/",
+ "evict_ratio": 0.05,
+ "index_accel": true
+
+ },
+ "num_log_interval_requests": 200
+}
+EOF
+done
+
+# 使用for循环启动工作节点
+for i in $(seq 0 $((NUM_WORKERS-1))); do
+ # 计算GPU设备ID
+ GPU_START=$((i*2))
+ GPU_END=$((i*2+1))
+
+ if [ $i -lt $((NUM_WORKERS-1)) ]; then
+ FLEXKV_CONFIG_PATH="./flexkv_config_${i}.json" CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} python3 -m dynamo.vllm --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tensor_parallel_size 2 --block-size 64 --gpu-memory-utilization 0.9 --max-model-len 100310 &
+ else
+ FLEXKV_CONFIG_PATH="./flexkv_config_${i}.json" CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} python3 -m dynamo.vllm --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tensor_parallel_size 2 --block-size 64 --gpu-memory-utilization 0.9 --max-model-len 100310
+ fi
+done
+```
+
+> 注:`flexkv_config.json`配置仅为简单示例,选项请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md)
+
+### 验证
+
+可通过如下命令验证Dynamo服务是否正确启动:
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
+ "messages": [
+ {
+ "role": "user",
+ "content": "Tell me a joke."
+ }
+ ],
+ "stream":false,
+ "max_tokens": 30
+ }'
+```
+## 4. Benchmark
+
+我们使用[genai-perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf)作为benchmark工具、[mooncake trace](https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#-open-source-trace)作为数据集来评估Dynamo + FlexKV的性能。
+
+Mooncake Trace 是一个开源请求记录文件,以jsonl格式保存。它记录了请求到达的时间戳、输入文本长度、输出文本长度以及与缓存有关的hash id等信息,包含了1小时内的23608个请求。我们的实验资源是4个LLaMA-70B worker,mooncake trace对于该配置来说并发太高了,于是我们从mooncake trace里每6个抽取1个request,构建了用于benchmark的数据集。
+
+genai-perf可以根据trace文件里的时间戳来发送请求,统计LLM服务的TTFT、TPOT等指标,命令如下。请使用genai-perf==0.0.13,更新的版本存在解析时间戳的bug。
+
+```bash
+ genai-perf profile --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --endpoint-type chat --endpoint /v1/chat/completions --streaming --url http://localhost:8000 --input-file payload:mooncake_trace_1_6.jsonl --random-seed 100 -v -H 'Authorization: Bearer NOT USED' -H 'Accept: text/event-stream' -- --stability-percentage 99
+```
\ No newline at end of file
diff --git a/docs/flexkv_config_reference/README_en.md b/docs/flexkv_config_reference/README_en.md
new file mode 100644
index 0000000000..f91ca77ba9
--- /dev/null
+++ b/docs/flexkv_config_reference/README_en.md
@@ -0,0 +1,147 @@
+# FlexKV Configuration Guide
+
+This guide explains how to configure and use the FlexKV online serving configuration file (`flexkv_config.json`), including the meaning of all parameters, recommended values, and typical usage scenarios.
+
+---
+
+## Recommended Configuration
+
+Below is a production-grade recommended configuration that balances performance and stability:
+
+```json
+{
+ "enable_flexkv": true,
+ "server_recv_port": "ipc:///tmp/flexkv_test",
+ "cache_config": {
+ "enable_cpu": true,
+ "enable_ssd": true,
+ "enable_remote": false,
+ "use_gds": false,
+ "enable_trace": false,
+ "ssd_cache_iouring_entries": 512,
+ "tokens_per_block": 64,
+ "num_cpu_blocks": 233000,
+ "num_ssd_blocks": 4096000,
+ "ssd_cache_dir": "/data/flexkv_ssd/",
+ "evict_ratio": 0.05,
+ "index_accel": true
+ },
+ "num_log_interval_requests": 2000
+}
+```
+- `num_cpu_blocks` and `num_ssd_blocks` represent the total number of blocks in CPU memory and SSD respectively. These values must be configured according to your machine specs and model size. See [Cache Capacity Configuration](#cache-capacity-config) for calculation details.
+- `ssd_cache_dir` specifies the directory where SSD-stored KV cache files are saved.
+
+---
+
+## Configuration File Structure Overview
+
+The FlexKV configuration file is a JSON file, primarily consisting of three parts:
+
+- `enable_flexkv`: Whether to enable FlexKV (must be set to `true` to take effect).
+- `server_recv_port`: The IPC port on which the FlexKV service listens.
+- `cache_config`: The core cache configuration object, containing all cache behavior parameters.
+- `num_log_interval_requests`: Log statistics interval (outputs performance log every N requests).
+
+---
+
+## Complete `cache_config` Parameter Reference (from [`flexkv/common/config.py`](../../flexkv/common/config.py))
+
+### Basic Configuration
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `tokens_per_block` | int | 16 | Number of tokens per KV block. Must match the `block_size` used in the acceleration framework (e.g., vLLM). |
+| `enable_cpu` | bool | true | Whether to enable CPU memory as a cache layer. Strongly recommended to enable. |
+| `enable_ssd` | bool | false | Whether to enable SSD as a cache layer. Recommended if NVMe SSD is available. |
+| `enable_remote` | bool | false | Whether to enable remote cache (e.g., scalable cloud storage). Requires remote cache engine and custom implementation. |
+| `use_gds` | bool | false | Whether to use GPU Direct Storage (GDS) to accelerate SSD I/O. Not currently supported. |
+| `index_accel` | bool | false | Whether to enable C++ RadixTree. Recommended to enable. |
+
+---
+
+### KV Cache Layout Types (Generally No Need to Modify)
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `gpu_kv_layout_type` | enum | LAYERWISE | Organization of KV cache on GPU (layer-wise or block-wise). Must match vLLM’s layout (currently `LAYERWISE`). |
+| `cpu_kv_layout_type` | enum | BLOCKWISE | Organization on CPU. Recommended to use `BLOCKWISE`. Does not need to match vLLM. |
+| `ssd_kv_layout_type` | enum | BLOCKWISE | Organization on SSD. Recommended to use `BLOCKWISE`. Does not need to match vLLM. |
+| `remote_kv_layout_type` | enum | BLOCKWISE | Organization for remote cache. Must be defined according to remote backend’s layout. |
+
+> Note: Do not modify layout types unless you have specific performance requirements.
+
+---
+
+### Cache Capacity Configuration
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `num_cpu_blocks` | int | 1000000 | Number of blocks allocated in CPU memory. Adjust based on available RAM. |
+| `num_ssd_blocks` | int | 10000000 | Number of blocks allocated on SSD. |
+| `num_remote_blocks` | int \| None | None | Number of blocks allocated in remote cache. |
+
+> Note: Block size in all cache levels (CPU/SSD/Remote) matches the GPU block size. Estimate cache capacities based on GPU KV cache memory usage and block count.
+
+> Note: `block_size = num_layer * _kv_dim * tokens_per_block * num_head * head_size * dtype_size`.
+
+---
+
+### CPU-GPU Transfer Optimization
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `use_ce_transfer_h2d` | bool | false | Whether to use CUDA Copy Engine for Host→Device transfers. Reduces SM usage but may slightly reduce bandwidth. Real-world difference is minimal. |
+| `use_ce_transfer_d2h` | bool | false | Whether to use CUDA Copy Engine for Device→Host transfers. |
+| `transfer_sms_h2d` | int | 8 | Number of SMs (Streaming Multiprocessors) allocated for H2D transfers. |
+| `transfer_sms_d2h` | int | 8 | Number of SMs allocated for D2H transfers. |
+
+---
+
+### SSD Cache Configuration
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `max_blocks_per_file` | int | 32000 | Maximum number of blocks per SSD file. `-1` means unlimited. |
+| `ssd_cache_dir` | str \| List[str] | None | **Required.** Path to SSD cache directory, e.g., `"/data/flexkv_ssd/"`. |
+| `ssd_cache_iouring_entries` | int | 0 | io_uring queue depth. Recommended: `512` for significantly improved concurrent I/O performance. |
+| `ssd_cache_iouring_flags` | int | 0 | io_uring flags. Keep as `0` in most cases. |
+
+> Note: To maximize bandwidth across multiple SSDs, bind each SSD to a separate directory and specify them as a list:
+> `"ssd_cache_dir": ["/data0/flexkv_ssd/", "/data1/flexkv_ssd/"]`.
+> KV blocks will be evenly distributed across all SSDs.
+
+> Note: Setting `ssd_cache_iouring_entries` to `0` disables io_uring. Not recommended.
+
+---
+
+### Remote Cache Configuration (Skip if not enabled)
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `remote_cache_size_mode` | str | "file_size" | Allocate remote cache space by file size or block count. |
+| `remote_file_size` | int \| None | None | Size (in bytes) of each remote file. |
+| `remote_file_num` | int \| None | None | Number of remote files. |
+| `remote_file_prefix` | str \| None | None | Prefix for remote file names. |
+| `remote_cache_path` | str \| List[str] | None | Remote cache path (e.g., Redis URL, S3 path). |
+| `remote_config_custom` | dict \| None | None | Custom remote cache configurations (e.g., timeout, authentication). |
+
+---
+
+### Tracing and Logging
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `enable_trace` | bool | true | Whether to enable performance tracing. Disable (`false`) in production to reduce overhead. |
+| `trace_file_path` | str | "./flexkv_trace.log" | Path to trace log file. |
+| `trace_max_file_size_mb` | int | 100 | Maximum size (MB) per trace log file. |
+| `trace_max_files` | int | 5 | Maximum number of trace log files to retain. |
+| `trace_flush_interval_ms` | int | 1000 | Trace log flush interval (milliseconds). |
+
+---
+
+### Cache Eviction Policy
+
+| Parameter Name | Type | Default | Description |
+|----------------|------|---------|-------------|
+| `evict_ratio` | float | 0.0 | Ratio of blocks to proactively evict from CPU/SSD per eviction cycle. `0.0` = evict only the minimal necessary blocks (more eviction cycles may impact performance). Recommended: `0.05` (evict 5% of least recently used blocks per cycle). |
\ No newline at end of file
diff --git a/docs/flexkv_config_reference/README_zh.md b/docs/flexkv_config_reference/README_zh.md
new file mode 100644
index 0000000000..1752f844bf
--- /dev/null
+++ b/docs/flexkv_config_reference/README_zh.md
@@ -0,0 +1,145 @@
+# FlexKV 配置使用指南
+
+本指南详细说明如何配置和使用 FlexKV 的在线服务配置文件(`flexkv_config.json`),涵盖所有参数的含义、推荐值及典型使用场景。
+
+---
+
+## 推荐配置方案
+
+以下是一个兼顾性能与稳定性的生产级推荐配置:
+
+```json
+{
+ "enable_flexkv": true,
+ "server_recv_port": "ipc:///tmp/flexkv_test",
+ "cache_config": {
+ "enable_cpu": true,
+ "enable_ssd": true,
+ "enable_remote": false,
+ "use_gds": false,
+ "enable_trace": false,
+ "ssd_cache_iouring_entries": 512,
+ "tokens_per_block": 64,
+ "num_cpu_blocks": 233000,
+ "num_ssd_blocks": 4096000,
+ "ssd_cache_dir": "/data/flexkv_ssd/",
+ "evict_ratio": 0.05,
+ "index_accel": true
+ },
+ "num_log_interval_requests": 2000
+}
+```
+- 其中的`num_cpu_blocks`和`num_ssd_blocks`分别代表内存和SSD中block的总数量,需要根据实际机器配置和模型来配置,具体计算方式见下文[缓存容量配置](#cache-capacity-config)
+- `ssd_cache_dir`为ssd中KVCache存放的文件目录
+
+---
+
+## 配置文件结构概览
+
+FlexKV 的配置文件是一个 JSON 文件,主要包含三个部分:
+
+- `enable_flexkv`: 是否启用 FlexKV 功能(必须设为 `true` 才生效)
+- `server_recv_port`: FlexKV 服务监听的 IPC 端口
+- `cache_config`: 核心缓存配置对象,包含所有缓存行为参数
+- `num_log_interval_requests`: 日志统计间隔(每处理 N 个请求输出一次性能日志)
+
+---
+
+## cache_config完整参数详解(来自 [`flexkv/common/config.py`](../../flexkv/common/config.py))
+
+### 基础配置
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `tokens_per_block` | int | 16 | 每个 KV Block 包含的 token 数量。需要与加速框架(如vLLM)中`block_size`保持一致 |
+| `enable_cpu` | bool | true | 是否启用 CPU 内存作为缓存层。强烈建议开启。 |
+| `enable_ssd` | bool | false | 是否启用 SSD 作为缓存层。如配备 NVMe SSD,建议开启。 |
+| `enable_remote` | bool | false | 是否启用远程缓存(如可扩展云存储等)。需要配合远程缓存和自定义的远程缓存引擎使用 |
+| `use_gds` | bool | false | 是否使用 GPU Direct Storage(GDS)加速 SSD 读写。目前暂不支持。 |
+| `index_accel` | bool | false | 是否启用C++ RadixTree。推荐开启。 |
+
+---
+
+### KV 缓存布局类型(一般无需修改)
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `gpu_kv_layout_type` | enum | LAYERWISE | GPU 上 KV Cache 的组织方式(按层或按块)。目前vLLM在GPU组织方式为`LAYERWISE`,因此FlexKV的`gpu_kv_layout_type`须与vLLM保持一致 |
+| `cpu_kv_layout_type` | enum | BLOCKWISE | CPU 上按块组织, 推荐使用`BLOCKWISE`,不需要与vLLM保持一致 |
+| `ssd_kv_layout_type` | enum | BLOCKWISE | SSD 上按块组织, 推荐使用`BLOCKWISE`,不需要与vLLM保持一致 |
+| `remote_kv_layout_type` | enum | BLOCKWISE | 远程缓存按块组织, 需要按照remote组织形式定义 |
+
+> 注:除非有特殊性能需求,否则不建议修改布局类型。
+
+---
+
+### 缓存容量配置
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `num_cpu_blocks` | int | 1000000 | CPU 缓存块数。根据内存大小调整。|
+| `num_ssd_blocks` | int | 10000000 | SSD 缓存块数。|
+| `num_remote_blocks` | int \| None | None | 远程缓存块数。|
+
+> 注:FlexKV里的各级缓存的block大小与GPU中的block大小保持一致,可以参考GPU的KVCache显存大小与block数量估算各级缓存中的block数量。
+
+> 注:block_size = num_layer * _kv_dim * tokens_per_block * num_head * self.head_size * torch_dtype.size()。
+
+---
+
+### CPU-GPU 传输优化
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `use_ce_transfer_h2d` | bool | false | 是否使用 cuda copy engine 优化 Host→Device 传输,使用CE可以减少GPU SM在传输上的使用,但是传输速度会降低,实际测试差距不大 |
+| `use_ce_transfer_d2h` | bool | false | 是否使用 cuda copy engine 优化 Device→Host 传输 |
+| `transfer_sms_h2d` | int | 8 | H2D 传输使用的流处理器数量 |
+| `transfer_sms_d2h` | int | 8 | D2H 传输使用的流处理器数量 |
+
+---
+
+### SSD 缓存配置
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `max_blocks_per_file` | int | 32000 | 单个 SSD 文件最多包含的 block 数。-1 表示无限制 |
+| `ssd_cache_dir` | str \| List[str] | None | SSD 缓存目录路径,**必须设置**,如 `"/data/flexkv_ssd/"` |
+| `ssd_cache_iouring_entries` | int | 0 | io_uring 队列深度,推荐设为 `512` 以提升并发 IO 性能,实测比不使用iouring提升较大,推荐使用512 |
+| `ssd_cache_iouring_flags` | int | 0 | io_uring 标志位,一般保持 0 |
+
+> 注:为了充分利用多块SSD的带宽上限,可以将多块SSD绑定至不同目录,并使用如 `"ssd cache dir": ["/data0/flexkv_ssd/", "/data1/flexkv_ssd/"]`方式初始化,SSD KVCache会均匀分布在所有SSD中,充分利用多个SSD带宽。
+
+> 注:`ssd_cache_iouring_entries`设置为0即不适用iouring,不推荐设置为0
+
+---
+
+### 远程缓存配置(不启用时无需配置)
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `remote_cache_size_mode` | str | "file_size" | 按文件大小或块数分配远程缓存空间 |
+| `remote_file_size` | int \| None | None | 单个远程文件大小(字节) |
+| `remote_file_num` | int \| None | None | 远程文件数量 |
+| `remote_file_prefix` | str \| None | None | 远程文件名前缀 |
+| `remote_cache_path` | str \| List[str] | None | 远程缓存路径(如 Redis URL、S3 路径等) |
+| `remote_config_custom` | dict \| None | None | 自定义远程缓存配置(如超时、认证等) |
+
+---
+
+### 追踪与日志
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `enable_trace` | bool | true | 是否启用性能追踪。生产环境建议关闭(`false`)以减少开销 |
+| `trace_file_path` | str | "./flexkv_trace.log" | 追踪日志路径 |
+| `trace_max_file_size_mb` | int | 100 | 单个追踪文件最大大小(MB) |
+| `trace_max_files` | int | 5 | 最多保留的追踪文件数 |
+| `trace_flush_interval_ms` | int | 1000 | 追踪日志刷新间隔(毫秒) |
+
+---
+
+### 缓存淘汰策略
+
+| 参数名 | 类型 | 默认值 | 说明 |
+|--------|------|--------|------|
+| `evict_ratio` | float | 0.0 | cpu,ssd一次evict主动淘汰比例(0.0 = 只淘汰最小的必要的block数量,较多的淘汰次数会影响性能)。建议保持 `0.05`,即每一次淘汰5%的最久未使用的block |
diff --git a/docs/vllm_adapter/README_en.md b/docs/vllm_adapter/README_en.md
index 781b3ad3ee..972cade803 100644
--- a/docs/vllm_adapter/README_en.md
+++ b/docs/vllm_adapter/README_en.md
@@ -63,6 +63,8 @@ VLLM_USE_V1=1 python -m vllm.entrypoints.cli.main serve Qwen3/Qwen3-32B \
```
+> Note: The `flexkv_config.json` configuration is provided as a simple example only. For full parameter options, please refer to [`docs/flexkv_config_reference/README_en.md`](../../docs/flexkv_config_reference/README_en.md)
+
## Legacy Version (<= 0.1.0) – Not Recommended for Current Use
### Supported Versions
diff --git a/docs/vllm_adapter/README_zh.md b/docs/vllm_adapter/README_zh.md
index 0e7ce7687e..bb9b51c292 100644
--- a/docs/vllm_adapter/README_zh.md
+++ b/docs/vllm_adapter/README_zh.md
@@ -62,6 +62,8 @@ VLLM_USE_V1=1 python -m vllm.entrypoints.cli.main serve Qwen3/Qwen3-32B \
```
+> 注:`flexkv_config.json`配置仅为简单示例,选项请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md)
+
## Legacy版本(<= 0.1.0),目前的版本尽量不要使用
### 适用版本