-
Notifications
You must be signed in to change notification settings - Fork 40
Add patch and doc for trtllm #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
peaceforeverCN
merged 13 commits into
taco-project:dev
from
axxx03:add_patch_and_doc_for_trtllm
Nov 21, 2025
Merged
Changes from 12 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
55160a4
add patch
axxx03 cfb8e6d
init docs
axxx03 aae7056
fin readme
axxx03 490a003
rename yml
axxx03 3ab6ed9
fix readme
axxx03 23121d9
fix readme
axxx03 5a40202
update docs
zhuofan1123 921fa1c
fix docs
zhuofan1123 5959688
fix docs
zhuofan1123 4a8f3c4
fix docs
zhuofan1123 46a28f8
add title
zhuofan1123 06fd5c1
add readme_en
zhuofan1123 605fe26
Update docs/trtllm_adaption/README_en.md
zhuofan1123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # Using FlexKV in TensorRT-LLM | ||
| ## 1. Environment Setup | ||
|
|
||
| ### 1.1 Install TensorRT-LLM (Tag v1.1.0.rc2) | ||
| We are currently working with the community to merge TensorRT-LLM adaptation code. Before it is merged into the main branch, there are two methods: | ||
| #### 1.1.1 Method 1 | ||
| You can use the patch we provide and recompile: | ||
| ```bash | ||
| cd TensorRT-LLM | ||
| git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch | ||
| ``` | ||
| Note: For TensorRT-LLM compilation instructions, please refer to [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux) | ||
|
|
||
| #### 1.1.2 Method 2 | ||
| You can also install our pre-compiled package: | ||
| ```bash | ||
| pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl | ||
| ``` | ||
|
|
||
| ## 2. Running | ||
|
|
||
| ### 2.1 Configure FlexKV | ||
|
|
||
| First, set the environment variable `TENSORRT_LLM_USE_FLEXKV` to enable FlexKV: | ||
| ```bash | ||
| export TENSORRT_LLM_USE_FLEXKV=1 | ||
| ``` | ||
|
|
||
| FlexKV can be configured through environment variables and configuration files. For details, please refer to [`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md). Below are two simple configuration examples. | ||
| ##### Example 1: Enable CPU Offloading Only | ||
| Use 32GB of CPU memory as secondary cache. | ||
| ```bash | ||
| unset FLEXKV_CONFIG_PATH | ||
| export FLEXKV_CPU_CACHE_GB=32 | ||
| ``` | ||
| ##### Example 2: Enable SSD Offloading | ||
| Use 32GB of CPU memory and 1TB of SSD storage as secondary and tertiary caches respectively. (Assuming the machine has two SSDs mounted at /data0 and /data1.) | ||
| ```bash | ||
| # generate config | ||
| cat <<EOF > ./flexkv_config.yml | ||
| cpu_cache_gb: 32 | ||
| ssd_cache_gb: 1024 | ||
| ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/ | ||
| enable_gds: false | ||
| EOF | ||
| export FLEXKV_CONFIG_PATH="./flexkv_config.yml" | ||
| ``` | ||
|
|
||
| ### 2.2 Launch TensorRT-LLM | ||
| #### 2.2.1. Method 1: Using Our Provided Example Script | ||
| ```bash | ||
| cd FLEXKV_DIR/examples/trtllm_adaption | ||
| bash launch.sh YOUR_MODEL_PATH | ||
| ``` | ||
| Note: The `launch.sh` script will launch both TensorRT-LLM and FlexKV, and configure FlexKV through `flexkv_config.json` in the same directory. | ||
| #### 2.2.2. Method 2: Custom Launch | ||
| After configuring FlexKV according to the instructions in section [2.1](#21-configure-flexkv), add the following content to your `extra-llm-api-config.yml`: | ||
| ```txt | ||
| kv_cache_config: | ||
| enable_partial_reuse: false | ||
| kv_connector_config: | ||
| connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter" | ||
| connector_scheduler_class: "FlexKVSchedulerConnector" | ||
| connector_worker_class: "FlexKVWorkerConnector" | ||
| ``` | ||
|
|
||
| ### 2.3 Potential TensorRT-LLM Issues | ||
| If you send a request to TensorRT-LLM that exceeds the `max_seq_len` length, you may encounter an error similar to the following: | ||
| ``` | ||
| [W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0) | ||
| [W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead. | ||
| [E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0 | ||
| ``` | ||
| This is caused by the TensorRT-LLM framework itself not filtering requests that exceed the `max_seq_len` length, and is not related to FlexKV. We are currently working with the community to fix this issue. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # 在 TensorRT-LLM 中使用 FlexKV | ||
| ## 1. 环境准备 | ||
|
|
||
| ### 1.1 安装 TensorRT-LLM(Tag 为 v1.1.0.rc2) | ||
| 目前我们正在推动社区合入 TensorRT-LLM 侧的适配代码,在合入主分支之前,有如下两种方法: | ||
| #### 1.1.1 方法一 | ||
| 您可以使用我们提供的 patch,然后重新编译: | ||
| ```bash | ||
| cd TensorRT-LLM | ||
| git apply FLEXKV_DIR/examples/trtllm_adaption/trtllm_v1.1.0rc2.patch | ||
| ``` | ||
| 注:TensorRT-LLM 的编译方式可以参考[这里](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#build-from-source-linux) | ||
|
|
||
| #### 1.1.2 方法二 | ||
| 您也可以安装我们预先编译好的包: | ||
| ```bash | ||
| pip install https://flexkv-1252113659.cos.ap-shanghai.myqcloud.com/TensorRT-LLM/tensorrt_llm-1.1.0rc2-cp312-cp312-linux_x86_64.whl | ||
| ``` | ||
|
|
||
| ## 2. 运行 | ||
|
|
||
| ### 2.1 配置FlexKV | ||
|
|
||
| 首先设置环境变量`TENSORRT_LLM_USE_FLEXKV`以启用FlexKV | ||
| ```bash | ||
| export TENSORRT_LLM_USE_FLEXKV=1 | ||
| ``` | ||
|
|
||
| 可以通过环境变量和配置文件两种方式配置FlexKV,具体请参考[`docs/flexkv_config_reference/README_zh.md`](../../docs/flexkv_config_reference/README_zh.md),下面提供了两个简单的配置示例。 | ||
| ##### 示例一:仅启用CPU卸载 | ||
| 使用32GB的CPU内存作为二级缓存。 | ||
| ```bash | ||
| unset FLEXKV_CONFIG_PATH | ||
| export FLEXKV_CPU_CACHE_GB=32 | ||
| ``` | ||
| ##### 示例二:启用SSD卸载 | ||
| 使用32GB的CPU内存和1T的SSD存储分别作为二级和三级缓存。(假设机器有两个SSD,并分别挂载在/data0和/data1两个路径上。) | ||
| ```bash | ||
| # generate config | ||
| cat <<EOF > ./flexkv_config.yml | ||
| cpu_cache_gb: 32 | ||
| ssd_cache_gb: 1024 | ||
| ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/ | ||
| enable_gds: false | ||
| EOF | ||
| export FLEXKV_CONFIG_PATH="./flexkv_config.yml" | ||
| ``` | ||
|
|
||
| ### 2.2 启动 TensorRT-LLM | ||
| #### 2.2.1. 方式一:使用我们提供的示例脚本 | ||
| ```bash | ||
| cd FLEXKV_DIR/examples/trtllm_adaption | ||
| bash launch.sh YOUR_MODEL_PATH | ||
| ``` | ||
| 注:`launch.sh` 脚本会同时启动 TensorRT-LLM 和 FlexKV,并通过同路径下的`flexkv_config.json`进行FlexKV的配置 | ||
| #### 2.2.2. 方式二:自定义启动 | ||
| 按照 [2.1](#21-配置flexkv) 节的指示配置好FlexKV,接着在您的 `extra-llm-api-config.yml`加入下面的内容: | ||
| ```txt | ||
| kv_cache_config: | ||
| enable_partial_reuse: false | ||
| kv_connector_config: | ||
| connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter" | ||
| connector_scheduler_class: "FlexKVSchedulerConnector" | ||
| connector_worker_class: "FlexKVWorkerConnector" | ||
| ``` | ||
|
|
||
| ### 2.3 TensorRT-LLM 潜在的问题 | ||
| 如果您向 TensorRT-LLM 发送了超过 `max_seq_len` 长度的请求,会出现类似下面的报错: | ||
| ``` | ||
| [W] `default_max_tokens` (-40205) should be greater than 0, `default_max_tokens` (-40205) = max_seq_len (40961) - `splited_prompt_len` (81166) - `query_token_len` (0) | ||
| [W] User-specified `max_tokens` (16384) is greater than deduced `default_max_tokens` (-40205), using default_max_tokens instead. | ||
| [E] submit request failed: [TensorRT-LLM][ERROR] Assertion failed: mMaxNewTokens > 0 | ||
| ``` | ||
| 这是 TensorRT-LLM 框架本身没有过滤超过 `max_seq_len` 长度的请求导致的,和 FlexKV 本身无关,目前我们正在推动社区修复这个问题。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.