Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ FlexKV is released under the **Apache-2.0 License**. See the [LICENSE](LICENSE)

```bash
apt install liburing-dev
apt install libxxhash-dev
apt install libxxhash-dev
```

### Build FlexKV
Expand All @@ -26,6 +26,10 @@ apt install libxxhash-dev

See [docs/vllm_adapter/README_en.md](docs/vllm_adapter/README_en.md)

### Use FlexKV with TensorRT-LLM

See [docs/trtllm_adaption/README_en.md](docs/trtllm_adaption/README_en.md)

### FlexKV Integration with Dynamo

See [docs/dynamo_integration/README_en.md](docs/dynamo_integration/README_en.md)
Expand Down Expand Up @@ -90,7 +94,7 @@ FlexKV performs:

## Roadmap

- **In-Process Cache Engine Integration**: In the dev branch, the implementation, integration, and invocation of the Cache Engine will be further optimized, along with synchronized updates to related APIs.
- **In-Process Cache Engine Integration**: In the dev branch, the implementation, integration, and invocation of the Cache Engine will be further optimized, along with synchronized updates to related APIs.
- **Framework Integration**: Support works for vLLM, SGLang, and other acceleration frameworks will be updated soon.
- **Distributed Query Support**: Enable scalable, distributed KVCache lookup.
- **Latency Optimization**: Further reduce *get* latency via smarter prefetching and compression.
10 changes: 7 additions & 3 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ FlexKV 采用 **Apache-2.0 开源协议**,详细信息请参见 [LICENSE](LICE

```bash
apt install liburing-dev
apt install libxxhash-dev
apt install libxxhash-dev
```

### 编译 FlexKV
Expand All @@ -22,10 +22,14 @@ apt install libxxhash-dev
#./build.sh --release for cython package
```

### vLLM 为例使用 FlexKV
### vLLM 中使用 FlexKV

见[docs/vllm_adapter/README_zh.md](docs/vllm_adapter/README_zh.md)

### 在 TensorRT-LLM 中使用 Flexkv

见[docs/trtllm_adaption/README_zh.md](docs/trtllm_adaption/README_zh.md)

### FlexKV和Dynamo框架的集成

见[docs/dynamo_integration/README_zh.md](docs/dynamo_integration/README_zh.md)
Expand Down Expand Up @@ -93,4 +97,4 @@ FlexKV 在处理 *get* 请求时:
- **缓存引擎共进程化**:dev 分支将进一步优化 Cache Engine 的实现、集成和调用,并同步更新相关 API 支持
- **加速框架支持**:对 vLLM、SGLang 等主流推理框架的适配将陆续发布
- **分布式查询支持**:实现可扩展的分布式 KVCache 查询能力
- **延迟优化**:通过预取、压缩等手段进一步降低 *get* 请求延迟
- **延迟优化**:通过预取、压缩等手段进一步降低 *get* 请求延迟
74 changes: 74 additions & 0 deletions docs/trtllm_adaption/README_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Using FlexKV in TensorRT-LLM
## 1. Environment Setup

### 1.1 Install TensorRT-LLM (Tag v1.1.0.rc2)
We are currently working with the community to merge TensorRT-LLM adaptation code. Before it is merged into the main branch, there are two methods:
#### 1.1.1 Method 1
You can use the patch we provide and recompile:
```bash
cd TensorRT-LLM
git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch
```
Note: For TensorRT-LLM compilation instructions, please refer to [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux)

#### 1.1.2 Method 2
You can also install our pre-compiled package:
```bash
pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl
```

## 2. Running

### 2.1 Configure FlexKV

First, set the environment variable `TENSORRT_LLM_USE_FLEXKV` to enable FlexKV:
```bash
export TENSORRT_LLM_USE_FLEXKV=1
```

FlexKV can be configured through environment variables and configuration files. For details, please refer to [`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md). Below are two simple configuration examples.
##### Example 1: Enable CPU Offloading Only
Use 32GB of CPU memory as secondary cache.
```bash
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
```
##### Example 2: Enable SSD Offloading
Use 32GB of CPU memory and 1TB of SSD storage as secondary and tertiary caches respectively. (Assuming the machine has two SSDs mounted at /data0 and /data1.)
```bash
# generate config
cat <<EOF > ./flexkv_config.yml
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
```

### 2.2 Launch TensorRT-LLM
#### 2.2.1. Method 1: Using Our Provided Example Script
```bash
cd FLEXKV_DIR/examples/trtllm_adaption
bash launch.sh YOUR_MODEL_PATH
```
Note: The `launch.sh` script will launch both TensorRT-LLM and FlexKV, and configure FlexKV through `flexkv_config.json` in the same directory.
#### 2.2.2. Method 2: Custom Launch
After configuring FlexKV according to the instructions in section [2.1](#21-configure-flexkv), add the following content to your `extra-llm-api-config.yml`:
```txt
kv_cache_config:
enable_partial_reuse: false
kv_connector_config:
connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
connector_scheduler_class: "FlexKVSchedulerConnector"
connector_worker_class: "FlexKVWorkerConnector"
```

### 2.3 Potential TensorRT-LLM Issues
If you send a request to TensorRT-LLM that exceeds the `max_seq_len` length, you may encounter an error similar to the following:
```
[W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0)
[W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead.
[E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0
```
This is caused by the TensorRT-LLM framework itself not filtering requests that exceed the `max_seq_len` length, and is not related to FlexKV. We are currently working with the community to fix this issue.
74 changes: 74 additions & 0 deletions docs/trtllm_adaption/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# 在 TensorRT-LLM 中使用 FlexKV
## 1. 环境准备

### 1.1 安装 TensorRT-LLM(Tag 为 v1.1.0.rc2)
目前我们正在推动社区合入 TensorRT-LLM 侧的适配代码,在合入主分支之前,有如下两种方法:
#### 1.1.1 方法一
您可以使用我们提供的 patch,然后重新编译:
```bash
cd TensorRT-LLM
git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch
```
注:TensorRT-LLM 的编译方式可以参考[这里](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux)

#### 1.1.2 方法二
您也可以安装我们预先编译好的包:
```bash
pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl
```

## 2. 运行

### 2.1 配置FlexKV

首先设置环境变量`TENSORRT_LLM_USE_FLEXKV`以启用FlexKV
```bash
export TENSORRT_LLM_USE_FLEXKV=1
```

可以通过环境变量和配置文件两种方式配置FlexKV,具体请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md),下面提供了两个简单的配置示例。
##### 示例一:仅启用CPU卸载
使用32GB的CPU内存作为二级缓存。
```bash
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
```
##### 示例二:启用SSD卸载
使用32GB的CPU内存和1T的SSD存储分别作为二级和三级缓存。(假设机器有两个SSD,并分别挂载在/data0和/data1两个路径上。)
```bash
# generate config
cat <<EOF > ./flexkv_config.yml
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
```

### 2.2 启动 TensorRT-LLM
#### 2.2.1. 方式一:使用我们提供的示例脚本
```bash
cd FLEXKV_DIR/examples/trtllm_adaption
bash launch.sh YOUR_MODEL_PATH
```
注:`launch.sh` 脚本会同时启动 TensorRT-LLM 和 FlexKV,并通过同路径下的`flexkv_config.json`进行FlexKV的配置
#### 2.2.2. 方式二:自定义启动
按照 [2.1](#21-配置flexkv) 节的指示配置好FlexKV,接着在您的 `extra-llm-api-config.yml`加入下面的内容:
```txt
kv_cache_config:
enable_partial_reuse: false
kv_connector_config:
connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
connector_scheduler_class: "FlexKVSchedulerConnector"
connector_worker_class: "FlexKVWorkerConnector"
```

### 2.3 TensorRT-LLM 潜在的问题
如果您向 TensorRT-LLM 发送了超过 `max_seq_len` 长度的请求,会出现类似下面的报错:
```
[W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0)
[W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead.
[E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0
```
这是 TensorRT-LLM 框架本身没有过滤超过 `max_seq_len` 长度的请求导致的,和 FlexKV 本身无关,目前我们正在推动社区修复这个问题。
7 changes: 4 additions & 3 deletions docs/vllm_adapter/README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
### Running
We provide an adaptation example based on **vLLM 0.10.1.1**:

1. apply patch
1. apply patch && installation
```bash
# FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
git apply examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
cd vllm
git apply FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
pip install -e . # build and install vllm from source
```

2. offline test
Expand Down
7 changes: 4 additions & 3 deletions docs/vllm_adapter/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,11 @@ export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
### 运行
我们提供了基于 **vLLM 0.10.1.1** 的适配示例:

1. apply patch
1. apply patch && installation
```bash
# FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
git apply examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
cd vllm
git apply FLEXKV_DIR/examples/vllm_adaption/vllm_0_10_1_1-flexkv-connector.patch
pip install -e . # build and install vllm from source
```

2. offline test
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,6 @@ kv_connector_config:
connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
connector_scheduler_class: "FlexKVSchedulerConnector"
connector_worker_class: "FlexKVWorkerConnector"
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
# speculative_config:
# decoding_type: MTP
# num_nextn_predict_layers: 3
2 changes: 1 addition & 1 deletion examples/trtllm_adaption/launch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ trtllm-serve serve $MODEL_PATH \
--max_seq_len $MAX_SEQ_LEN \
--max_num_tokens $MAX_NUM_TOKENS \
--max_batch_size $BATCH_SIZE \
--extra_llm_api_options extra-llm-api-config-cg.yml 2>&1 | tee logs/$TIMESTAMP.log
--extra_llm_api_options extra-llm-api-config.yml 2>&1 | tee logs/$TIMESTAMP.log
Loading