Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [1.0.0] - 2025-09-11

### Added
- C++ radix tree for fast match, need set "index_accel": true in cache_config
- sync kernel launch
- a huge change that move cache engine to a library for accelerator(vLLM e.g.) to use instead of server-client mode.
This accelerate the get and put when no KVCache is matched. This version includes breaking API changes and is not backward compatible.
- add evict_ratio, need set "evict_ratio": 0.05 in cache_config
- reducing the bubble inner the launch kernel
- add vLLM 0.10.1.1 adapter

### Fixed
- cython release package


## [0.1.0] - 2025-08-29

### Init
- init version
- add license

13 changes: 13 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Contributing to FlexKV

Thank you for your interest in contributing to FlexKV!

## PR Title and Classification
Use a prefixed PR title to indicate the type of changes. Please use one of the following:

- `[bugfix]` for bugfixes
- `[feature]` for new features
- `[test]` for test cases
- `[ci/build]` for build or continuous integration improvements
- `[doc]` for documentation fixes
- `[misc]` for PRs that do not fit the above categories. Please use this sparingly.
27 changes: 13 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,27 @@ FlexKV is released under the **Apache-2.0 License**. See the [LICENSE](LICENSE)

## How to Use

### Build FlexKV
### Install Dependencies

```bash
./build.sh
apt install liburing-dev
apt install libxxhash-dev
```

### Use FlexKV with vLLM (v0.8.4)

Apply the patch `examples/vllm_adaption/flexkv_vllm_0_8_4.patch` to vLLM 0.8.4, then start FlexKV, vLLM, and the benchmark script:
### Build FlexKV

```bash
# Start FlexKV as server
bash benchmarks/flexkv_benchmark/run_flexkv_server.sh
./build.sh
#./build.sh --release for cython package
```

# Start vLLM as client
bash benchmarks/flexkv_benchmark/serving_vllm.sh
### Use FlexKV with vLLM

# Start benchmark
bash benchmarks/flexkv_benchmark/multiturn_benchmark.sh
```
Apply the patch `examples/vllm_adaption/flexkv_vllm_0_10_0.patch` to vLLM 0.10.0, and use the same testing method as above.
See [docs/vllm_adapter/README_en.md](docs/vllm_adapter/README_en.md)

### FlexKV Integration with Dynamo

> **Note**: The current script is only compatible with the `main` branch. Support for the latest features in the `dev` branch is under development.
See [docs/dynamo_integration/README_en.md](docs/dynamo_integration/README_en.md)

## Design Architecture

Expand Down Expand Up @@ -88,6 +86,7 @@ FlexKV performs:
- The main branch is the stable branch, which maintains already tested commits. Please pull from main branch if you need stable code.
- The dev branch is the development branch, which contains newer features. Please branch from and merge into dev if you need new features or are developing new functionality.
- The bugfix branch is for bug fixes, maintaining urgent bugs that need immediate resolution or documentation that requires prompt updates. If you need to fix a bug or update documentation urgently, please branch from and merge into the bugfix branch.
- The stable branch refers to the previous main branch state, intended only for rollback or extremely conservative use cases (e.g., production deployment). Its use is discouraged.

## Roadmap

Expand Down
25 changes: 12 additions & 13 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,27 @@ FlexKV 采用 **Apache-2.0 开源协议**,详细信息请参见 [LICENSE](LICE

## 如何使用

### 安装依赖

```bash
apt install liburing-dev
apt install libxxhash-dev
```

### 编译 FlexKV

```bash
./build.sh
#./build.sh --release for cython package
```

### 以 vLLM 为例使用 FlexKV

在 vLLM 0.8.4 版本中应用patch `examples/vllm_adaption/flexkv_vllm_0_8_4.patch`,分别启动 FlexKV、vLLM 和测试脚本:
见[docs/vllm_adapter/README_zh.md](docs/vllm_adapter/README_zh.md)

```bash
# 启动 FlexKV 作为服务端
bash benchmarks/flexkv_benchmark/run_flexkv_server.sh

# 启动 vLLM 作为客户端
bash benchmarks/flexkv_benchmark/serving_vllm.sh

# 启动性能测试
bash benchmarks/flexkv_benchmark/multiturn_benchmark.sh
```
在 vLLM 0.10.0 版本中应用patch `examples/vllm_adaption/flexkv_vllm_0_10_0.patch`,测试方法同上。
### FlexKV和Dynamo框架的集成

> **注意**:当前脚本仅适配 `main` 分支。`dev` 分支的最新特性支持脚本正在开发中。
见[docs/dynamo_integration/README_zh.md](docs/dynamo_integration/README_zh.md)

## 设计框架

Expand Down Expand Up @@ -88,6 +86,7 @@ FlexKV 在处理 *get* 请求时:
- main 为稳定分支,维护已经测试过的commit。需要稳定的代码请从此分支拉取。
- dev 为开发分支,维护较新特性。需要新特性和开发新特性请从此分支拉取和合入。
- bugfix 为bug分支,维护需要立即解决的bug或需要立即更新的文档。需要解决bug和立即更新的文档请从此分支拉取和合入。
- stable 为上一个版本的main分支位置,仅用于回滚以及极其保守的情况使用(如产品化)。不鼓励使用此版本。

## Roadmap

Expand Down
1 change: 1 addition & 0 deletions VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.0.0
1 change: 0 additions & 1 deletion benchmarks/example_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
"enable_remote": false,
"tokens_per_block": 16,
"use_gds": false,
"use_pinned_memory": true,
"gpu_kv_layout_type": "LAYERWISE",
"cpu_kv_layout_type": "BLOCKWISE",
"ssd_kv_layout_type": "BLOCKWISE",
Expand Down
43 changes: 30 additions & 13 deletions csrc/bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@

namespace py = pybind11;

#ifdef CUDA_AVAILABLE
void transfer_kv_blocks_binding(
torch::Tensor &gpu_block_id_tensor, torch::Tensor &gpu_layer_ptrs_tensor,
int64_t gpu_kv_stride_in_bytes, int64_t gpu_block_stride_in_bytes,
Expand Down Expand Up @@ -60,7 +61,9 @@ void transfer_kv_blocks_binding(
throw std::runtime_error(cudaGetErrorString(err));
}
}
#endif

#ifdef CUDA_AVAILABLE
void transfer_kv_blocks_ssd_binding(
flexkv::SSDIOCTX &ioctx,
const torch::Tensor &cpu_layer_id_list, int64_t cpu_tensor_ptr,
Expand All @@ -82,6 +85,7 @@ void transfer_kv_blocks_ssd_binding(
block_stride_in_bytes, is_read, num_blocks_per_file, round_robin,
num_threads_per_device, is_mla);
}
#endif
#ifdef FLEXKV_ENABLE_CFS
void transfer_kv_blocks_remote(
const py::list &file_nodeid_list, const torch::Tensor &cpu_layer_id_list,
Expand Down Expand Up @@ -162,6 +166,7 @@ void shared_transfer_kv_blocks_remote_read_binding(
#endif

PYBIND11_MODULE(c_ext, m) {
#ifdef CUDA_AVAILABLE
m.def("transfer_kv_blocks", &transfer_kv_blocks_binding,
"Transfer multi-layer KV-cache between CPU and GPU");
m.def("transfer_kv_blocks_ssd", &transfer_kv_blocks_ssd_binding,
Expand All @@ -174,6 +179,7 @@ PYBIND11_MODULE(c_ext, m) {
py::arg("block_stride_in_bytes"), py::arg("is_read"),
py::arg("num_blocks_per_file"), py::arg("round_robin") = 1,
py::arg("num_threads_per_device") = 16, py::arg("is_mla") = false);
#endif
#ifdef FLEXKV_ENABLE_CFS
m.def("transfer_kv_blocks_remote", &transfer_kv_blocks_remote,
"Transfer KV blocks between remote and CPU memory",
Expand Down Expand Up @@ -249,6 +255,7 @@ PYBIND11_MODULE(c_ext, m) {
m.def("call_pcfs_write", &flexkv::call_pcfs_write,
"Call Pcfs::write from C++", py::arg("file_nodeid"), py::arg("offset"),
py::arg("buffer"), py::arg("size"), py::arg("thread_id"));
#ifdef CUDA_AVAILABLE
m.def("shared_transfer_kv_blocks_remote_read",
&shared_transfer_kv_blocks_remote_read_binding,
"Shared transfer KV blocks from remote PCFS to CPU memory",
Expand All @@ -266,6 +273,7 @@ PYBIND11_MODULE(c_ext, m) {
py::arg("total_layers"),
py::arg("is_mla") = false,
py::arg("num_threads_per_file") = 8);
#endif
#endif

py::class_<flexkv::CRadixTreeIndex>(m, "CRadixTreeIndex")
Expand Down Expand Up @@ -297,7 +305,7 @@ PYBIND11_MODULE(c_ext, m) {
.def("has_block_node_ids", &flexkv::CRadixNode::has_block_node_ids);

py::class_<flexkv::CMatchResult, std::shared_ptr<flexkv::CMatchResult>>(m, "CMatchResult")
.def(py::init<int, int, int, flexkv::CRadixNode *, flexkv::CRadixNode *, std::vector<int64_t> *>())
.def(py::init<int, int, int, flexkv::CRadixNode *, flexkv::CRadixNode *, torch::Tensor, torch::Tensor>())
.def_readonly("last_ready_node", &flexkv::CMatchResult::last_ready_node)
.def_readonly("last_node", &flexkv::CMatchResult::last_node)
.def_readonly("physical_blocks", &flexkv::CMatchResult::physical_blocks)
Expand All @@ -318,14 +326,14 @@ PYBIND11_MODULE(c_ext, m) {

// RedisMetaChannel binding
py::class_<flexkv::RedisMetaChannel>(m, "RedisMetaChannel")
.def(py::init<const std::string&, int, uint32_t, const std::string&, const std::string&>(),
py::arg("host"), py::arg("port"), py::arg("node_id"), py::arg("local_ip"), py::arg("blocks_key") = std::string("blocks"))
.def(py::init<const std::string&, int, uint32_t, const std::string&, const std::string&, const std::string&>(),
py::arg("host"), py::arg("port"), py::arg("node_id"), py::arg("local_ip"), py::arg("blocks_key") = std::string("blocks"), py::arg("password") = std::string(""))
.def("connect", &flexkv::RedisMetaChannel::connect)
.def("get_node_id", &flexkv::RedisMetaChannel::get_node_id)
.def("get_local_ip", &flexkv::RedisMetaChannel::get_local_ip)
.def("make_block_key", &flexkv::RedisMetaChannel::make_block_key, py::arg("node_id"), py::arg("hash"))
.def("publish_one", [](flexkv::RedisMetaChannel &ch, const flexkv::BlockMeta &m){ ch.publish(m); })
.def("publish_batch", [](flexkv::RedisMetaChannel &ch, const std::vector<flexkv::BlockMeta> &metas, size_t batch_size){ ch.publish(metas, batch_size); }, py::arg("metas"), py::arg("batch_size")=100)
.def("publish_one", [](flexkv::RedisMetaChannel &ch, const flexkv::BlockMeta &m){ return ch.publish(m); })
.def("publish_batch", [](flexkv::RedisMetaChannel &ch, const std::vector<flexkv::BlockMeta> &metas, size_t batch_size){ return ch.publish(metas, batch_size); }, py::arg("metas"), py::arg("batch_size")=100)
.def("load", [](flexkv::RedisMetaChannel &ch, size_t max_items){ std::vector<flexkv::BlockMeta> out; ch.load(out, max_items); return out; }, py::arg("max_items"))
.def("renew_node_leases", &flexkv::RedisMetaChannel::renew_node_leases, py::arg("node_id"), py::arg("new_lt"), py::arg("batch_size")=200)
.def("list_keys", [](flexkv::RedisMetaChannel &ch, const std::string &pattern){ std::vector<std::string> keys; ch.list_keys(pattern, keys); return keys; }, py::arg("pattern"))
Expand All @@ -334,19 +342,18 @@ PYBIND11_MODULE(c_ext, m) {
.def("hmget_field_for_keys", [](flexkv::RedisMetaChannel &ch, const std::vector<std::string> &keys, const std::string &field){ std::vector<std::string> values; ch.hmget_field_for_keys(keys, field, values); return values; }, py::arg("keys"), py::arg("field"))
.def("hmget_two_fields_for_keys", [](flexkv::RedisMetaChannel &ch, const std::vector<std::string> &keys, const std::string &f1, const std::string &f2){ std::vector<std::pair<std::string,std::string>> out; ch.hmget_two_fields_for_keys(keys, f1, f2, out); return out; }, py::arg("keys"), py::arg("field1"), py::arg("field2"))
.def("load_metas_by_keys", [](flexkv::RedisMetaChannel &ch, const std::vector<std::string> &keys){ std::vector<flexkv::BlockMeta> out; ch.load_metas_by_keys(keys, out); return out; }, py::arg("keys"))
.def("update_block_state_batch", [](flexkv::RedisMetaChannel &ch, uint32_t node_id, const std::vector<int64_t> &hashes, flexkv::NodeState state, size_t batch_size){ std::deque<int64_t> dq(hashes.begin(), hashes.end()); ch.update_block_state_batch(node_id, &dq, state, batch_size); }, py::arg("node_id"), py::arg("hashes"), py::arg("state"), py::arg("batch_size")=200)
.def("delete_blockmeta_batch", [](flexkv::RedisMetaChannel &ch, uint32_t node_id, const std::vector<int64_t> &hashes, size_t batch_size){ std::deque<int64_t> dq(hashes.begin(), hashes.end()); ch.delete_blockmeta_batch(node_id, &dq, batch_size); }, py::arg("node_id"), py::arg("hashes"), py::arg("batch_size")=200);
.def("update_block_state_batch", [](flexkv::RedisMetaChannel &ch, uint32_t node_id, const std::vector<int64_t> &hashes, int state, size_t batch_size){ std::deque<int64_t> dq(hashes.begin(), hashes.end()); return ch.update_block_state_batch(node_id, &dq, state, batch_size); }, py::arg("node_id"), py::arg("hashes"), py::arg("state"), py::arg("batch_size")=200)
.def("delete_blockmeta_batch", [](flexkv::RedisMetaChannel &ch, uint32_t node_id, const std::vector<int64_t> &hashes, size_t batch_size){ std::deque<int64_t> dq(hashes.begin(), hashes.end()); return ch.delete_blockmeta_batch(node_id, &dq, batch_size); }, py::arg("node_id"), py::arg("hashes"), py::arg("batch_size")=200);

// LocalRadixTree bindings (derived from CRadixTreeIndex)
py::class_<flexkv::LocalRadixTree, flexkv::CRadixTreeIndex>(m, "LocalRadixTree")
.def(py::init<int, int, uint32_t, uint32_t, uint32_t, uint32_t, size_t>(),
.def(py::init<int, int, uint32_t, uint32_t, uint32_t, uint32_t>(),
py::arg("tokens_per_block"),
py::arg("max_num_blocks") = 1000000,
py::arg("lease_ttl_ms") = 100000,
py::arg("renew_lease_ms") = 0,
py::arg("refresh_batch_size") = 256,
py::arg("idle_sleep_ms") = 10,
py::arg("lt_pool_initial_capacity") = 0)
py::arg("idle_sleep_ms") = 10)
.def("set_meta_channel", &flexkv::LocalRadixTree::set_meta_channel, py::arg("channel"))
.def("start", &flexkv::LocalRadixTree::start, py::arg("channel"))
.def("stop", &flexkv::LocalRadixTree::stop)
Expand Down Expand Up @@ -374,14 +381,14 @@ PYBIND11_MODULE(c_ext, m) {

// DistributedRadixTree bindings (remote reference tree manager)
py::class_<flexkv::DistributedRadixTree>(m, "DistributedRadixTree")
.def(py::init<int, int, uint32_t, size_t, size_t, uint32_t, uint32_t>(),
.def(py::init<int, int, uint32_t, size_t, uint32_t, uint32_t, uint32_t>(),
py::arg("tokens_per_block"),
py::arg("max_num_blocks"),
py::arg("node_id"),
py::arg("lt_pool_initial_capacity") = 0,
py::arg("refresh_batch_size") = 128,
py::arg("rebuild_interval_ms") = 1000,
py::arg("idle_sleep_ms") = 10)
py::arg("idle_sleep_ms") = 10,
py::arg("lease_renew_ms") = 5000)
.def("start", &flexkv::DistributedRadixTree::start, py::arg("channel"))
.def("stop", &flexkv::DistributedRadixTree::stop)
.def("remote_tree_refresh", &flexkv::DistributedRadixTree::remote_tree_refresh, py::return_value_policy::reference)
Expand All @@ -392,4 +399,14 @@ PYBIND11_MODULE(c_ext, m) {
.def("unlock", &flexkv::DistributedRadixTree::unlock, py::arg("node"))
.def("is_empty", &flexkv::DistributedRadixTree::is_empty)
.def("set_ready", &flexkv::DistributedRadixTree::set_ready, py::arg("node"), py::arg("ready") = true, py::arg("ready_length") = -1);

// RefRadixTree bindings (for type information)
py::class_<flexkv::RefRadixTree, flexkv::CRadixTreeIndex>(m, "RefRadixTree")
.def(py::init<int, int, uint32_t, flexkv::LockFreeQueue<flexkv::CRadixNode*>*>(),
py::arg("tokens_per_block"),
py::arg("max_num_blocks") = 1000000,
py::arg("lease_renew_ms") = 5000,
py::arg("renew_lease_queue") = nullptr)
.def("dec_ref_cnt", &flexkv::RefRadixTree::dec_ref_cnt)
.def("inc_ref_cnt", &flexkv::RefRadixTree::inc_ref_cnt);
}
4 changes: 2 additions & 2 deletions csrc/block_meta.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

#include <cstdint>

#include "radix_tree.h" // for NodeState
#include "lease_meta_mempool.h" // for NODE_STATE_* macros

namespace flexkv {

Expand All @@ -12,7 +12,7 @@ struct BlockMeta {
uint32_t nid; // node id
int64_t hash; // current block hash
uint32_t lt; // lease time
NodeState state; // lease state
int state; // lease state
};

} // namespace flexkv
Expand Down
Loading