taco-project · peaceforeverCN · Nov 21, 2025 · Nov 19, 2025 · Nov 19, 2025 · Nov 19, 2025
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ FlexKV is released under the **Apache-2.0 License**. See the [LICENSE](LICENSE)
 
 ```bash
 apt install liburing-dev
-apt install libxxhash-dev 
+apt install libxxhash-dev
 ```
 
 ### Build FlexKV
@@ -26,6 +26,10 @@ apt install libxxhash-dev
 
 See [docs/vllm_adapter/README_en.md](docs/vllm_adapter/README_en.md)
 
+### Use FlexKV with TensorRT-LLM
+
+See [docs/trtllm_adaption/README_en.md](docs/trtllm_adaption/README_en.md)
+
 ### FlexKV Integration with Dynamo
 
 See [docs/dynamo_integration/README_en.md](docs/dynamo_integration/README_en.md)
@@ -90,7 +94,7 @@ FlexKV performs:
 
 ## Roadmap
 
-- **In-Process Cache Engine Integration**: In the dev branch, the implementation, integration, and invocation of the Cache Engine will be further optimized, along with synchronized updates to related APIs. 
+- **In-Process Cache Engine Integration**: In the dev branch, the implementation, integration, and invocation of the Cache Engine will be further optimized, along with synchronized updates to related APIs.
 - **Framework Integration**: Support works for vLLM, SGLang, and other acceleration frameworks will be updated soon.
 - **Distributed Query Support**: Enable scalable, distributed KVCache lookup.
 - **Latency Optimization**: Further reduce *get* latency via smarter prefetching and compression.
diff --git a/README_zh.md b/README_zh.md
@@ -12,7 +12,7 @@ FlexKV 采用 **Apache-2.0 开源协议**，详细信息请参见 [LICENSE](LICE
 
 ```bash
 apt install liburing-dev
-apt install libxxhash-dev 
+apt install libxxhash-dev
 ```
 
 ### 编译 FlexKV
@@ -22,10 +22,14 @@ apt install libxxhash-dev
 #./build.sh --release for cython package
 ```
 
-### 以 vLLM 为例使用 FlexKV
+### 在 vLLM 中使用 FlexKV
 
 见[docs/vllm_adapter/README_zh.md](docs/vllm_adapter/README_zh.md)
 
+### 在 TensorRT-LLM 中使用 Flexkv
+
+见[docs/trtllm_adaption/README_zh.md](docs/trtllm_adaption/README_zh.md)
+
 ### FlexKV和Dynamo框架的集成
 
 见[docs/dynamo_integration/README_zh.md](docs/dynamo_integration/README_zh.md)
@@ -93,4 +97,4 @@ FlexKV 在处理 *get* 请求时：
 - **缓存引擎共进程化**：dev 分支将进一步优化 Cache Engine 的实现、集成和调用，并同步更新相关 API 支持
 - **加速框架支持**：对 vLLM、SGLang 等主流推理框架的适配将陆续发布
 - **分布式查询支持**：实现可扩展的分布式 KVCache 查询能力
-- **延迟优化**：通过预取、压缩等手段进一步降低 *get* 请求延迟
+- **延迟优化**：通过预取、压缩等手段进一步降低 *get* 请求延迟
diff --git a/docs/trtllm_adaption/README_en.md b/docs/trtllm_adaption/README_en.md
@@ -0,0 +1,74 @@
+# Using FlexKV in TensorRT-LLM
+## 1. Environment Setup
+
+### 1.1 Install TensorRT-LLM (Tag v1.1.0.rc2)
+We are currently working with the community to merge TensorRT-LLM adaptation code. Before it is merged into the main branch, there are two methods:
+#### 1.1.1 Method 1
+You can use the patch we provide and recompile:
+```bash
+cd TensorRT-LLM
+git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch
+```
+Note: For TensorRT-LLM compilation instructions, please refer to [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux)
+
+#### 1.1.2 Method 2
+You can also install our pre-compiled package:
+```bash
+pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl
+```
+
+## 2. Running
+
+### 2.1 Configure FlexKV
+
+First, set the environment variable `TENSORRT_LLM_USE_FLEXKV` to enable FlexKV:
+```bash
+export TENSORRT_LLM_USE_FLEXKV=1
+```
+
+FlexKV can be configured through environment variables and configuration files. For details, please refer to [`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md). Below are two simple configuration examples.
+##### Example 1: Enable CPU Offloading Only
+Use 32GB of CPU memory as secondary cache.
+```bash
+unset FLEXKV_CONFIG_PATH
+export FLEXKV_CPU_CACHE_GB=32
+```
+##### Example 2: Enable SSD Offloading
+Use 32GB of CPU memory and 1TB of SSD storage as secondary and tertiary caches respectively. (Assuming the machine has two SSDs mounted at /data0 and /data1.)
+```bash
+# generate config
+cat <<EOF > ./flexkv_config.yml
+cpu_cache_gb: 32
+ssd_cache_gb: 1024
+ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
+enable_gds: false
+EOF
+export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
+```
+
+### 2.2 Launch TensorRT-LLM
+#### 2.2.1. Method 1: Using Our Provided Example Script
+```bash
+cd FLEXKV_DIR/examples/trtllm_adaption
+bash launch.sh YOUR_MODEL_PATH
+```
+Note: The `launch.sh` script will launch both TensorRT-LLM and FlexKV, and configure FlexKV through `flexkv_config.json` in the same directory.
+#### 2.2.2. Method 2: Custom Launch
+After configuring FlexKV according to the instructions in section [2.1](#21-configure-flexkv), add the following content to your `extra-llm-api-config.yml`:
+```txt
+kv_cache_config:
+  enable_partial_reuse: false
+kv_connector_config:
+  connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
+  connector_scheduler_class: "FlexKVSchedulerConnector"
+  connector_worker_class: "FlexKVWorkerConnector"
+```
+
+### 2.3 Potential TensorRT-LLM Issues
+If you send a request to TensorRT-LLM that exceeds the `max_seq_len` length, you may encounter an error similar to the following:
+```
+[W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0)
+[W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead.
+[E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0
+```
+This is caused by the TensorRT-LLM framework itself not filtering requests that exceed the `max_seq_len` length, and is not related to FlexKV. We are currently working with the community to fix this issue.
diff --git a/docs/trtllm_adaption/README_zh.md b/docs/trtllm_adaption/README_zh.md
@@ -0,0 +1,74 @@
+# 在 TensorRT-LLM 中使用 FlexKV
+## 1. 环境准备
+
+### 1.1 安装 TensorRT-LLM（Tag 为 v1.1.0.rc2）
+目前我们正在推动社区合入 TensorRT-LLM 侧的适配代码，在合入主分支之前，有如下两种方法：
+#### 1.1.1 方法一
+您可以使用我们提供的 patch，然后重新编译：
+```bash
+cd TensorRT-LLM
+git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch
+```
+注：TensorRT-LLM 的编译方式可以参考[这里](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux)
+
+#### 1.1.2 方法二
+您也可以安装我们预先编译好的包：
+```bash
+pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl
+```
+
+## 2. 运行
+
+### 2.1 配置FlexKV
+
+首先设置环境变量`TENSORRT_LLM_USE_FLEXKV`以启用FlexKV
+```bash
+export TENSORRT_LLM_USE_FLEXKV=1
+```
+
+可以通过环境变量和配置文件两种方式配置FlexKV，具体请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md)，下面提供了两个简单的配置示例。
+##### 示例一：仅启用CPU卸载
+使用32GB的CPU内存作为二级缓存。
+```bash
+unset FLEXKV_CONFIG_PATH
+export FLEXKV_CPU_CACHE_GB=32
+```
+##### 示例二：启用SSD卸载
+使用32GB的CPU内存和1T的SSD存储分别作为二级和三级缓存。（假设机器有两个SSD，并分别挂载在/data0和/data1两个路径上。）
+```bash
+# generate config
+cat <<EOF > ./flexkv_config.yml
+cpu_cache_gb: 32
+ssd_cache_gb: 1024
+ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
+enable_gds: false
+EOF
+export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
+```
+
+### 2.2 启动 TensorRT-LLM
+#### 2.2.1. 方式一：使用我们提供的示例脚本
+```bash
+cd FLEXKV_DIR/examples/trtllm_adaption
+bash launch.sh YOUR_MODEL_PATH
+```
+注：`launch.sh` 脚本会同时启动 TensorRT-LLM 和 FlexKV，并通过同路径下的`flexkv_config.json`进行FlexKV的配置
+#### 2.2.2. 方式二：自定义启动
+按照 [2.1](#21-配置flexkv) 节的指示配置好FlexKV，接着在您的 `extra-llm-api-config.yml`加入下面的内容：
+```txt
+kv_cache_config:
+  enable_partial_reuse: false
+kv_connector_config:
+  connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
+  connector_scheduler_class: "FlexKVSchedulerConnector"
+  connector_worker_class: "FlexKVWorkerConnector"
+```
+
+### 2.3 TensorRT-LLM 潜在的问题
+如果您向 TensorRT-LLM 发送了超过 `max_seq_len` 长度的请求，会出现类似下面的报错：
+```
+[W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0)
+[W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead.
+[E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0
+```
+这是 TensorRT-LLM 框架本身没有过滤超过 `max_seq_len` 长度的请求导致的，和 FlexKV 本身无关，目前我们正在推动社区修复这个问题。
diff --git a/docs/vllm_adapter/README_en.md b/docs/vllm_adapter/README_en.md
@@ -43,10 +43,11 @@ export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
 ### Running
 We provide an adaptation example based on **vLLM 0.10.1.1**:
 
-1. apply patch
+1. apply patch && installation
 ```bash
-# FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
-git apply examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+cd vllm
+git apply FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+pip install -e . # build and install vllm from source
 ```
 
 2. offline test

diff --git a/docs/vllm_adapter/README_zh.md b/docs/vllm_adapter/README_zh.md
@@ -42,10 +42,11 @@ export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
 ### 运行
 我们提供了基于 **vLLM 0.10.1.1** 的适配示例：
 
-1. apply patch
+1. apply patch && installation
 ```bash
-# FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
-git apply examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+cd vllm
+git apply FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
+pip install -e . # build and install vllm from source
 ```
 
 2. offline test

diff --git a/...tllm_adaption/extra-llm-api-config-cg.yml → .../trtllm_adaption/extra-llm-api-config.yml b/...tllm_adaption/extra-llm-api-config-cg.yml → .../trtllm_adaption/extra-llm-api-config.yml
@@ -15,6 +15,6 @@ kv_connector_config:
   connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
   connector_scheduler_class: "FlexKVSchedulerConnector"
   connector_worker_class: "FlexKVWorkerConnector"
-speculative_config:
-    decoding_type: MTP
-    num_nextn_predict_layers: 3
+# speculative_config:
+#     decoding_type: MTP
+#     num_nextn_predict_layers: 3
diff --git a/examples/trtllm_adaption/launch.sh b/examples/trtllm_adaption/launch.sh
@@ -20,4 +20,4 @@ trtllm-serve serve $MODEL_PATH \
     --max_seq_len $MAX_SEQ_LEN \
     --max_num_tokens $MAX_NUM_TOKENS \
     --max_batch_size $BATCH_SIZE \
-    --extra_llm_api_options extra-llm-api-config-cg.yml 2>&1 | tee logs/$TIMESTAMP.log 
+    --extra_llm_api_options extra-llm-api-config.yml 2>&1 | tee logs/$TIMESTAMP.log