Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,54 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.2.0] - 2025-11-25
### Feature
Universal:
- Add support for distributed sharing of the KV Cache, to suppot KV Cache sharing between CPU and SSD, as well as distributed sharing of PCFS ([#17](https://github.com/taco-project/FlexKV/pull/17))
- Add GDS (GPU Direct Storage) Support ([#25](https://github.com/taco-project/FlexKV/pull/25))
- TP16 support ([#26](https://github.com/taco-project/FlexKV/pull/26))
- Support more kv cache layout. Now include: vLLM, SGLang, TensorRT-LM ([#27](https://github.com/taco-project/FlexKV/pull/27))
- GDS refactor & gtensor support ([#42](https://github.com/taco-project/FlexKV/pull/42))
- Support construct TensorSharedHandle directly from CUDA IPC Handle ([#44](https://github.com/taco-project/FlexKV/pull/44))


Targeting vllm:
- Support dp > 1 while integrated with vllm ([#18](https://github.com/taco-project/FlexKV/pull/18))
- Add launch scripts for vllm adaption ([#47](https://github.com/taco-project/FlexKV/pull/47))
- Support TP16 for vLLM+FlexKV ([#59](https://github.com/taco-project/FlexKV/pull/59))

Targeting TensorRT-LLM:
- Support using FlexKV on TensorRT-LLM ([#48](https://github.com/taco-project/FlexKV/pull/48))
- Support TP16 for TensorRT-LLM+FlexKV ([#53](https://github.com/taco-project/FlexKV/pull/53))

### Optimization
- Mla d2h transfer optimization ([#19](https://github.com/taco-project/FlexKV/pull/19))
- optimize SSD I/O ([#33](https://github.com/taco-project/FlexKV/pull/33))
- Enhance cache eviction with frequency-aware grace time mechanism ([#38](https://github.com/taco-project/FlexKV/pull/38))
- Replace std::map with std::unordered_map in RadixTree ([#41](https://github.com/taco-project/FlexKV/pull/41))

### Bugfix
- Fix wrong head number for DeepSeek for vllm integration ([#23](https://github.com/taco-project/FlexKV/pull/23))
- Fix bug, if cpu match len is bigger than ssd when put, it will cause error ([#24](https://github.com/taco-project/FlexKV/pull/24))
- Fix benchmark_worker ([#31](https://github.com/taco-project/FlexKV/pull/31))
- Fix segfault caused by radix tree array out-of-bounds access ([#39](https://github.com/taco-project/FlexKV/pull/39))
- Fix cache_info ([#40](https://github.com/taco-project/FlexKV/pull/40))
- Fix port for GPU registration ([#45](https://github.com/taco-project/FlexKV/pull/45))
- Fix SSD allocator ([#46](https://github.com/taco-project/FlexKV/pull/46))
- Fix vllm init num_kv_heads bug ([#67](https://github.com/taco-project/FlexKV/pull/67))
- Fix model_config for non-MLA models ([#68](https://github.com/taco-project/FlexKV/pull/68))

### Misc
- Add doc for:
FlexKV + TensorRT-LLM ([#52](https://github.com/taco-project/FlexKV/pull/52))
- For config: Simplify user configuration ([#37](https://github.com/taco-project/FlexKV/pull/37)), and other slight update ([#43](https://github.com/taco-project/FlexKV/pull/43))

## [1.1.0] - 2025-09-15
- Add op-level callback for local get/put [#13](https://github.com/taco-project/FlexKV/pull/13)
- Add doc for:
FlexKV + Dynamo ([#14](https://github.com/taco-project/FlexKV/pull/14)),
flexkv_config.json ([#15](https://github.com/taco-project/FlexKV/pull/15)),

## [1.0.0] - 2025-09-11

### Added
Expand Down
44 changes: 39 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,36 @@ FlexKV is a distributed KV store and multi-level cache management system develop

FlexKV is released under the **Apache-2.0 License**. See the [LICENSE](LICENSE) file for details.


## Main Change for latest version
### Feature
Universal:
- Add op-level callback for local get/put [#13](https://github.com/taco-project/FlexKV/pull/13)
- Add support for distributed sharing of the KV Cache, to suppot KV Cache sharing between CPU and SSD, as well as distributed sharing of PCFS ([#17](https://github.com/taco-project/FlexKV/pull/17))
- Add GDS (GPU Direct Storage) Support ([#25](https://github.com/taco-project/FlexKV/pull/25))
- TP16 support ([#26](https://github.com/taco-project/FlexKV/pull/26))
- Support more kv cache layout. Now include: vLLM, SGLang, TensorRT-LM ([#27](https://github.com/taco-project/FlexKV/pull/27))
- GDS refactor & gtensor support ([#42](https://github.com/taco-project/FlexKV/pull/42))
- Support construct TensorSharedHandle directly from CUDA IPC Handle ([#44](https://github.com/taco-project/FlexKV/pull/44))


Targeting vLLM:
- Support dp > 1 while integrated with vLLM ([#18](https://github.com/taco-project/FlexKV/pull/18))
- Add launch scripts for vLLM adaption ([#47](https://github.com/taco-project/FlexKV/pull/47))
- Support TP16 for vLLM+FlexKV ([#59](https://github.com/taco-project/FlexKV/pull/59))

Targeting TensorRT-LLM
- Support using FlexKV on TensorRT-LLM ([#48](https://github.com/taco-project/FlexKV/pull/48))
- Support TP16 for TensorRT-LLM+FlexKV ([#53](https://github.com/taco-project/FlexKV/pull/53))

### Optimization
- Mla d2h transfer optimization ([#19](https://github.com/taco-project/FlexKV/pull/19))
- optimize SSD I/O ([#33](https://github.com/taco-project/FlexKV/pull/33))
- Enhance cache eviction with frequency-aware grace time mechanism ([#38](https://github.com/taco-project/FlexKV/pull/38))
- Replace std::map with std::unordered_map in RadixTree ([#41](https://github.com/taco-project/FlexKV/pull/41))

For more details, see [CHANGELOG](CHANGELOG.md)

## How to Use

### Install Dependencies
Expand Down Expand Up @@ -86,11 +116,15 @@ FlexKV performs:
- *get* requests can be called asynchronously; the time for matching and data transfer can overlap with prior computation through prefetching.
- *put* requests can be called asynchronously; the time to copy data from GPU to CPU memory can overlap with subsequent computation. Data transfers between CPU memory, SSD, and scalable storage are fully handled asynchronously by the TransferEngine and transparent to the main process.

## Branch
- The main branch is the stable branch, which maintains already tested commits. Please pull from main branch if you need stable code.
- The dev branch is the development branch, which contains newer features. Please branch from and merge into dev if you need new features or are developing new functionality.
- The bugfix branch is for bug fixes, maintaining urgent bugs that need immediate resolution or documentation that requires prompt updates. If you need to fix a bug or update documentation urgently, please branch from and merge into the bugfix branch.
- The stable branch refers to the previous main branch state, intended only for rollback or extremely conservative use cases (e.g., production deployment). Its use is discouraged.
## Branching Strategy

The branch management strategy of this project is as follows:

- **`main` branch**: The main development branch that contains the latest features and changes. All pull requests are merged directly into `main` to ensure rapid iteration and continuous integration.

- **`release-*` branches**: When `main` reaches a stable state, we create dedicated release branches (e.g., `release-1.0`, `release-1.1`) to provide stable, production-ready versions for users.

Note: Critical fixes discovered in released versions are applied directly to the corresponding `release-*` branch and then backported to `main` to maintain consistency across all active branches.

## Roadmap

Expand Down
44 changes: 38 additions & 6 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,34 @@ FlexKV是腾讯云TACO团队和社区合作开发推出的面向超大规模 LLM

FlexKV 采用 **Apache-2.0 开源协议**,详细信息请参见 [LICENSE](LICENSE) 文件。

## 最新版本主要变更
### 功能
通用功能:
- 添加本地 get/put 的操作级回调 [#13](https://github.com/taco-project/FlexKV/pull/13)
- 添加分布式 KV Cache 共享支持,支持 CPU 和 SSD 之间的 KV Cache 共享,以及 PCFS 的分布式共享 ([#17](https://github.com/taco-project/FlexKV/pull/17))
- 添加 GDS (GPU Direct Storage) 支持 ([#25](https://github.com/taco-project/FlexKV/pull/25))
- TP16 支持 ([#26](https://github.com/taco-project/FlexKV/pull/26))
- 支持更多 kv cache 布局。现在包括:vLLM、SGLang、TensorRT-LM ([#27](https://github.com/taco-project/FlexKV/pull/27))
- GDS 重构和 gtensor 支持 ([#42](https://github.com/taco-project/FlexKV/pull/42))
- 支持直接从 CUDA IPC Handle 构造 TensorSharedHandle ([#44](https://github.com/taco-project/FlexKV/pull/44))


针对 vLLM:
- 在 vLLM 集成中支持 dp > 1 ([#18](https://github.com/taco-project/FlexKV/pull/18))
- 添加 vLLM 适配的启动脚本 ([#47](https://github.com/taco-project/FlexKV/pull/47))
- 支持 vLLM+FlexKV 的 TP16 ([#59](https://github.com/taco-project/FlexKV/pull/59))

针对 TensorRT-LLM
- 在 TensorRT-LLM 上支持使用 FlexKV ([#48](https://github.com/taco-project/FlexKV/pull/48))
- 支持 vLLM+FlexKV 的 TP16 ([#53](https://github.com/taco-project/FlexKV/pull/53))
### 优化
- MLA d2h 传输优化 ([#19](https://github.com/taco-project/FlexKV/pull/19))
- 优化 SSD I/O ([#33](https://github.com/taco-project/FlexKV/pull/33))
- 增强缓存淘汰机制,引入频率感知的宽限时间 ([#38](https://github.com/taco-project/FlexKV/pull/38))
- 在 RadixTree 中使用 std::unordered_map 替代 std::map ([#41](https://github.com/taco-project/FlexKV/pull/41))

更多详细信息,请参阅 [CHANGELOG](CHANGELOG.md)

## 如何使用

### 安装依赖
Expand Down Expand Up @@ -86,15 +114,19 @@ FlexKV 在处理 *get* 请求时:
- *get*请求可以异步调用,*get*匹配和传输时间可以通过预取与之前的计算重合。
- *put*请求可以异步调用,从GPU copy到内存的时间可以与之后的计算重合。内存与SSD以及扩展存储间的传输则完全由TransferEngine之后执行,主进程不感知。

## Branch
- main 为稳定分支,维护已经测试过的commit。需要稳定的代码请从此分支拉取。
- dev 为开发分支,维护较新特性。需要新特性和开发新特性请从此分支拉取和合入。
- bugfix 为bug分支,维护需要立即解决的bug或需要立即更新的文档。需要解决bug和立即更新的文档请从此分支拉取和合入。
- stable 为上一个版本的main分支位置,仅用于回滚以及极其保守的情况使用(如产品化)。不鼓励使用此版本。
## 分支策略

本项目的分支管理策略如下:

- **`main` 分支**:主开发分支,包含最新的功能和变更。所有拉取请求都直接合并到 `main` 分支,以确保快速迭代和持续集成。

- **`release-*` 分支**:当 `main` 分支达到稳定状态时,我们会创建专门的发布分支(例如 `release-1.0`、`release-1.1`),为用户提供稳定、可用于生产环境的版本。

注意:在已发布版本中发现的关键修复会直接应用到对应的 `release-*` 分支,然后回退到 `main` 分支,以保持所有活跃分支的一致性。

## Roadmap

- **缓存引擎共进程化**:dev 分支将进一步优化 Cache Engine 的实现、集成和调用,并同步更新相关 API 支持
- **加速框架支持**:对 vLLM、SGLang 等主流推理框架的适配将陆续发布
- **分布式查询支持**:实现可扩展的分布式 KVCache 查询能力
- **延迟优化**:通过预取、压缩等手段进一步降低 *get* 请求延迟
- **延迟优化**:通过预取、压缩等手段进一步降低 *get* 请求延迟
4 changes: 1 addition & 3 deletions examples/trtllm_adaption/launch.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
mkdir -p logs
TIMESTAMP=$(date +%Y.%m.%d-%H:%M:%S)
MODEL_PATH=${1:-YOUR_MODEL_PATH}

BATCH_SIZE=4
Expand All @@ -20,4 +18,4 @@ trtllm-serve serve $MODEL_PATH \
--max_seq_len $MAX_SEQ_LEN \
--max_num_tokens $MAX_NUM_TOKENS \
--max_batch_size $BATCH_SIZE \
--extra_llm_api_options extra-llm-api-config.yml 2>&1 | tee logs/$TIMESTAMP.log
--extra_llm_api_options extra-llm-api-config.yml
54 changes: 54 additions & 0 deletions examples/trtllm_adaption/multi_node_launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
BATCH_SIZE=4
TP_SIZE=16
EP_SIZE=$TP_SIZE
MAX_SEQ_LEN=155648
MAX_NUM_TOKENS=16384
# MAX_SEQ_LEN=8192
# MAX_NUM_TOKENS=8192
HOSTFILE=YOUR_HOSTFILE
MODEL_PATH=${1:-YOUR_MODEL_PATH}

export FLEXKV_CONFIG_PATH=$(realpath "./flexkv_config.json")
export TENSORRT_LLM_USE_FLEXKV=1
export FLEXKV_MASTER_HOST="172.16.0.30"
export FLEXKV_MASTER_PORTS="5556,5557,5558"
export FLEXKV_TRT_SUBPROCESS_HOST="172.16.0.30"
export FLEXKV_TRT_SUBPROCESS_PORTS="6667,6668,6669"
export TLLM_LOG_FIRST_RANK_ONLY=0

mpirun -np 16 \
--hostfile $HOSTFILE \
-mca plm_rsh_args "-p 9898" \
-mca btl tcp,self \
-mca btl_tcp_if_include eth0 \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-x GLOO_SOCKET_IFNAME=eth0 \
-x NCCL_DEBUG=INFO \
-x NCCL_IBEXT_DISABLE=0 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_IB_TIMEOUT=22 \
-x NCCL_SOCKET_IFNAME=eth0 \
-x OMPI_MCA_btl=tcp,self \
-x OMPI_MCA_btl_tcp_if_include=eth0 \
-x FLEXKV_CONFIG_PATH \
-x TENSORRT_LLM_USE_FLEXKV \
-x FLEXKV_MASTER_HOST \
-x FLEXKV_MASTER_PORTS \
-x TLLM_LOG_FIRST_RANK_ONLY \
-x FLEXKV_TRT_SUBPROCESS_HOST \
-x FLEXKV_TRT_SUBPROCESS_PORTS \
--allow-run-as-root \
trtllm-llmapi-launch trtllm-serve $MODEL_PATH \
--host 0.0.0.0 \
--port 6000 \
--backend pytorch \
--tp_size $TP_SIZE \
--ep_size $EP_SIZE \
--max_seq_len $MAX_SEQ_LEN \
--max_num_tokens $MAX_NUM_TOKENS \
--max_batch_size $BATCH_SIZE \
--extra_llm_api_options extra-llm-api-config.yml
32 changes: 21 additions & 11 deletions flexkv/integration/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

from flexkv.common.debug import flexkv_logger
from flexkv.common.config import *
from transformers import AutoConfig as HFAutoConfig

if TYPE_CHECKING:
from vllm.v1.kv_cache_interface import KVCacheConfig, FullAttentionSpec
Expand Down Expand Up @@ -65,12 +64,15 @@ def post_init_from_vllm_config(
self.cache_config.tokens_per_block = vllm_config.cache_config.block_size

self.model_config.num_layers = vllm_config.model_config.get_num_layers(vllm_config.parallel_config)
self.model_config.num_kv_heads = vllm_config.model_config.get_num_kv_heads(vllm_config.parallel_config)
self.model_config.head_size = vllm_config.model_config.get_head_size()
self.model_config.dtype = vllm_config.model_config.dtype
self.model_config.use_mla = vllm_config.model_config.is_deepseek_mla
self.model_config.tp_size = vllm_config.parallel_config.tensor_parallel_size
self.model_config.dp_size = vllm_config.parallel_config.data_parallel_size
if self.model_config.use_mla:
self.model_config.num_kv_heads = 1
else:
self.model_config.num_kv_heads = vllm_config.model_config.get_total_num_kv_heads()

self.__post_init__()

Expand Down Expand Up @@ -119,9 +121,6 @@ def post_init_from_sglang_config(
def post_init_from_trt_config(
self,
config,
tp_size: int,
dp_size: int,
dp_rank: int,
):
self.cache_config.tokens_per_block = config.tokens_per_block
# Convert dtype string to torch.dtype
Expand All @@ -141,13 +140,19 @@ def post_init_from_trt_config(
self.model_config.dtype = dtype_map.get(dtype_str, torch.bfloat16)
else:
self.model_config.dtype = dtype_str

self.model_config.tp_size = tp_size
self.model_config.dp_size = dp_size
self.model_config.dp_rank = dp_rank

# Set model config (parallel configs part)
if config.mapping.enable_attention_dp:
self.model_config.tp_size = 1
self.model_config.dp_size = config.mapping.tp_size
else:
self.model_config.tp_size = config.mapping.tp_size
self.model_config.dp_size = 1

# self.model_config (model configs part)
try:
model_path = getattr(config, 'hf_model_dir', None)
from transformers import AutoConfig as HFAutoConfig
hf_config = HFAutoConfig.from_pretrained(
str(model_path),
trust_remote_code=True
Expand All @@ -161,8 +166,13 @@ def post_init_from_trt_config(
self.model_config.head_size = hf_config.kv_lora_rank + hf_config.qk_rope_head_dim
self.model_config.num_kv_heads = 1
else:
self.model_config.head_size = hf_config.hidden_size // hf_config.num_key_value_heads // self.model_config.tp_size
self.model_config.num_kv_heads = hf_config.num_key_value_heads
if hasattr(hf_config, 'num_key_value_heads'):
assert hf_config.num_attention_heads != hf_config.num_key_value_heads, f"{hf_config.num_attention_heads=}, {hf_config.num_key_value_heads=}"
self.model_config.head_size = hf_config.head_dim
self.model_config.num_kv_heads = hf_config.num_key_value_heads
else:
self.model_config.head_size = hf_config.hidden_size // hf_config.num_attention_heads
self.model_config.num_kv_heads = hf_config.num_attention_heads

except Exception as e:
flexkv_logger.error(f"Failed to load config from {model_path}: {e}")
Expand Down
Loading