[KV Connector] Support using FlexKV as KV Cache Offloading option. by feiqiangs · Pull Request #34328 · vllm-project/vllm

feiqiangs · 2026-02-11T09:38:40Z

FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching and a distributed KVCache pool to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:

ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QPM increases by 16%.

github-actions · 2026-02-11T09:38:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-02-11T09:39:21Z

Documentation preview: https://vllm--34328.org.readthedocs.build/en/34328/

gemini-code-assist

Code Review

This pull request introduces support for FlexKV as a KV cache offloading option by adding a new FlexKVConnectorV1. The changes include the connector implementation, factory registration, documentation, and a new example script. The core connector is a thin wrapper around the flexkv library. The implementation looks good, but the example script prefix_caching_flexkv.py has a few issues that should be addressed to make it more robust and user-friendly. Specifically, it uses hardcoded paths for IPC and the model, which hurts portability, and relies on time.sleep() for synchronization, which is unreliable. My review provides suggestions to fix these issues.

gemini-code-assist · 2026-02-11T09:40:23Z

examples/offline_inference/prefix_caching_flexkv.py

+
+
+flexkv_config = {
+    "server_recv_port": "ipc:///tmp/flexkv_test",


The hardcoded IPC path ipc:///tmp/flexkv_test can lead to conflicts if multiple instances of this example are run concurrently on the same machine. This could cause unexpected failures or race conditions. To ensure the example is robust and can be run in parallel, a unique path should be generated for each run, for example by appending the process ID.

Suggested change

"server_recv_port": "ipc:///tmp/flexkv_test",

"server_recv_port": f"ipc:///tmp/flexkv_test_{os.getpid()}",

gemini-code-assist · 2026-02-11T09:40:24Z

examples/offline_inference/prefix_caching_flexkv.py

+    "kv_role": "kv_both",
+}
+
+model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B")


The default model path "/data0/models/Qwen3/Qwen3-32B" is hardcoded to a specific developer's environment. This will cause the example to fail for other users who do not have this path. To make the example runnable for everyone, it's better to remove the default value and require the user to set the MODEL_PATH environment variable, raising an error if it's not set.

Suggested change

model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B")

model_path = os.environ.get("MODEL_PATH")

if model_path is None:

raise ValueError("Please set the MODEL_PATH environment variable.")

gemini-code-assist · 2026-02-11T09:40:24Z

examples/offline_inference/prefix_caching_flexkv.py

+    prefix_cached_llm.generate(generating_prompts[0], sampling_params)
+
+    # wait for offload kv task finished.
+    time.sleep(2)


Using time.sleep(2) to wait for an asynchronous offload task to complete is unreliable. The task might take more or less time depending on system load and other factors. This can lead to flaky behavior in the example, where it might fail intermittently or have unnecessary delays. A more robust synchronization mechanism should be used. If the FlexKV API provides a way to block until the operation is complete (e.g., a future, an event, or a blocking call), it should be used instead. This would make the example more reliable and demonstrate better programming practices.

Absolutely right that time.sleep(2) is not a reliable synchronization mechanism. The offload task here is an asynchronous background operation managed internally by FlexKV. Unfortunately, the current LLM (offline inference) API does not expose a blocking interface to wait for the offload to complete — the synchronization (wait_for_save) happens internally at the worker level during the forward pass. In practice, 2 seconds is more than sufficient for the async KV transfer task to complete in typical scenarios, so this works reliably for the purpose of this example.

gemini-code-assist · 2026-02-11T09:40:24Z

examples/offline_inference/prefix_caching_flexkv.py

+    print(f"Generated answers are the same: {generated_same}")
+
+    # wait for offload kv task finished.
+    time.sleep(2)


Similar to the previous comment, using time.sleep(2) here is unreliable for synchronizing with an asynchronous task. This can make the example flaky. Please use a proper synchronization primitive from the FlexKV API if one is available.

This is the same case as the previous comment — the 2-second sleep is intentional here for the same reason.

markmc · 2026-02-11T10:52:02Z

vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py

+            vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config
+        )
+        try:
+            from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl


Looking at the vLLM APIs used by this connector:

# vllm from vllm.distributed.kv_transfer.kv_connector.v1.base import ( KVConnectorMetadata, KVConnectorRole) from vllm.distributed.kv_transfer.kv_connector.v1.metrics import KVConnectorStats from vllm.distributed.parallel_state import get_tp_group if TYPE_CHECKING: from vllm.config import VllmConfig from vllm.v1.core.sched.output import SchedulerOutput from vllm.attention.backends.abstract import AttentionMetadata from vllm.distributed.kv_events import KVCacheEvent from vllm.forward_context import ForwardContext from vllm.v1.core.kv_cache_manager import KVCacheBlocks from vllm.v1.request import Request from vllm.v1.outputs import KVConnectorOutput

I don't think we (the vLLM project) commit to backwards compatibility with these APIs, so it's quite likely the connector will need to be updated as new vLLM releases come out

(This is really just an FYI - I don't have an alternate recommendation, and I can't even be confidently specific about which of these APIs we do expect to remain backwards compatible)

Thanks for the heads-up! We understand that these internal APIs are subject to change and vLLM does not guarantee backwards compatibility. To address this, we have added type checking and compatibility logic in the FlexKV-vLLM adapter code to better handle potential changes. We (the FlexKV team) will keep an eye on vLLM updates and maintain the connector to ensure compatibility with future releases.

You could also setup a test job in CI that installs FlexKV so we know when it breaks

You could also setup a test job in CI that installs FlexKV so we know when it breaks

Thanks for the suggestion! We've added a CI workflow in the FlexKV repository to track vLLM API compatibility: .github/workflows/vllm-compat-test.yml

It runs on a daily schedule and covers three checks:

All vLLM internal symbols used by FlexKVConnectorV1 are still importable.

FlexKVConnectorV1 implements every abstract method required by KVConnectorBase_V1.

All overridden methods still exist in the base class.

We think it's more appropriate to maintain this CI test in the FlexKV repository itself rather than adding it to vLLM's CI, to avoid introducing external dependencies into vLLM's test pipeline.

mergify · 2026-02-27T09:48:00Z

Hi @feiqiangs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T12:44:07Z

Hi @feiqiangs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mgoin · 2026-02-27T14:18:03Z

examples/offline_inference/prefix_caching_flexkv.py

+        1. git clone [email protected]:taco-project/FlexKV.git
+        2. cd FlexKV && bash build.sh


Could you consider pinning to a branch or commit? I'm not sure how active development is

Thanks for the suggestion! We intentionally keep this as a clone of the main branch rather than pinning to a specific tag/commit, because FlexKV maintains a dedicated CI workflow (vllm-compat-test.yml) that continuously validates compatibility with vLLM. This ensures the main branch of FlexKV is always compatible with the corresponding vLLM version. We (the FlexKV team) are committed to keeping the main branch in a release-ready state at all times. Pinning to a specific commit would require frequent updates to this example whenever FlexKV releases bug fixes.

mgoin · 2026-02-27T14:18:49Z

vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py

+            vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config
+        )
+        try:
+            from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl


You could also setup a test job in CI that installs FlexKV so we know when it breaks

linhu-nv · 2026-03-04T09:13:15Z

@markmc @mgoin we have address the comments, can you please help kick off CI? thanks

mgoin · 2026-03-09T16:42:12Z

examples/offline_inference/prefix_caching_flexkv.py

+
+Usage:
+    1. Run this script:
+       python examples/offline_inference/prefix_caching_flexkv.py


Seems you are missing the MODEL_PATH env set in your example. You could also just make it a CLI arg

Thanks! I've updated the example to use --model as a required CLI argument instead of the MODEL_PATH env var. The usage is now:

python examples/offline_inference/prefix_caching_flexkv.py \ --model /path/to/your/model

mgoin · 2026-03-09T16:43:18Z

vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py

+            from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl
+        except ImportError as e:
+            raise ImportError(
+                "FlexKV is not installed. Please install it to use FlexKVConnectorV1. "


It would be best to share a link or instructions to install so the user can be guided properly

Thanks for the feedback! I've updated the docstring to include both a link and step-by-step installation instructions. The current code now reads:

Installation:
See https://github.com/taco-project/FlexKV for installation instructions.
Quick start::

git clone [email protected]:taco-project/FlexKV.git cd FlexKV && bash build.sh

Additionally, the ImportError message also directs users to the GitHub repo:

raise ImportError( "FlexKV is not installed. Please install it to use " "FlexKVConnectorV1. See https://github.com/taco-project/FlexKV " "for installation instructions." ) from e

This should give users clear guidance

NickLucche

Apologies for the delay in getting back to you on this one @feiqiangs .

Do we have any tests that we can run along with this connector?

NickLucche · 2026-03-09T17:31:50Z

vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py

+            **kwargs (Any): additional arguments for the load operation
+
+        Note:
+            The number of elements in kv_caches and layer_names should be
+            the same.
+
+        """
+        self._flexkv_connector.start_load_kv(forward_context, **kwargs)
+
+    def wait_for_layer_load(self, layer_name: str) -> None:
+        """
+        Block until the KV for a specific layer is loaded into vLLM's
+        paged buffer. This is called from within attention layer to ensure
+        async copying from start_load_kv is complete.
+
+        This interface will be useful for layer-by-layer pipelining.
+
+        Args:
+            layer_name: the name of that layer
+        """
+        self._flexkv_connector.wait_for_layer_load(layer_name)


Are you making use of both these paths here?
Usually connectors either load by layer blocking or load all layers async, but I dont see a flag to switch logic here

Thanks for the question, and apologies for the confusing docstrings!

To clarify the design: FlexKV uses a scheduler-side transfer model (similar to NIXL's scheduler-side coordination) — all KV transfers are coordinated by the scheduler connector — build_connector_meta calls launch_tasks() to kick off async transfers, and update_connector_output polls query_finished_task() for completion. KV blocks are moved directly between the FlexKV server and vLLM's GPU memory without any worker-side intervention during the forward pass.

So to answer your question: we are not using both paths — start_load_kv, wait_for_layer_load, save_kv_layer, and wait_for_save are all currently no-ops. We keep them because:

KVConnectorBase_V1 requires implementing these methods

They serve as extension points for a potential future worker-side layer-pipelining optimisation

We've updated the docstrings in the latest commit to make this explicit. Sorry for the confusion caused by the misleading comments!

feiqiangs · 2026-03-11T06:33:15Z

Apologies for the delay in getting back to you on this one @feiqiangs .

Do we have any tests that we can run along with this connector?

No worries at all @NickLucche — apologies for the delay on my end as well.

Yes, I've added unit tests for the FlexKV connector. You can find them at:

tests/v1/kv_connector/unit/test_flexkv_connector.py

The tests mock the external flexkv package (similar to how test_lmcache_connector.py mocks lmcache) and cover:

Import error handling: Verifies a clear ImportError is raised when flexkv is not installed.
Delegation of all public API methods: Ensures each method on FlexKVConnectorV1 correctly delegates to the underlying FlexKVConnectorV1Impl, including:
- bind_connector_metadata
- start_load_kv
- wait_for_layer_load
- save_kv_layer
- wait_for_save
- get_num_new_matched_tokens
- update_state_after_alloc
- build_connector_meta
- request_finished
- get_pending_request_count

This commit introduces the FlexKV connector, enabling integration with FlexKV, a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference. Signed-off-by: phaedonsun <[email protected]>

linhu-nv · 2026-03-12T02:27:19Z

@mgoin @NickLucche Thanks for the review and suggestions. Seems the pr is ready now? Can you please approve this PR to push this forward? Thanks

mgoin

LGTM, thanks for iterating. Will keep the CI in the external package for now then

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]>

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]> Signed-off-by: Monishver Chandrasekaran <[email protected]>

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]> Signed-off-by: Vinay Damodaran <[email protected]>

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]> Signed-off-by: EricccYang <[email protected]>

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]>

feiqiangs requested review from ApostaC, NickLucche and orozery as code owners February 11, 2026 09:38

mergify bot added documentation Improvements or additions to documentation kv-connector labels Feb 11, 2026

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

markmc reviewed Feb 11, 2026

View reviewed changes

feiqiangs force-pushed the vllm_support_flexkv branch from 9805c7a to 601880a Compare February 27, 2026 09:40

feiqiangs force-pushed the vllm_support_flexkv branch from 601880a to 20923e8 Compare February 27, 2026 12:39

mgoin reviewed Feb 27, 2026

View reviewed changes

feiqiangs force-pushed the vllm_support_flexkv branch 5 times, most recently from b6cc98a to 0dfe92a Compare March 2, 2026 08:49

mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Mar 9, 2026

github-project-automation bot added this to NVIDIA Mar 9, 2026

mgoin reviewed Mar 9, 2026

View reviewed changes

NickLucche reviewed Mar 9, 2026

View reviewed changes

feiqiangs force-pushed the vllm_support_flexkv branch from 16041e9 to d46be5f Compare March 11, 2026 06:05

mergify bot added the v1 label Mar 11, 2026

feiqiangs force-pushed the vllm_support_flexkv branch from d46be5f to 88464f9 Compare March 11, 2026 06:16

[KV Connector] Add FlexKV connector support

cac2953

This commit introduces the FlexKV connector, enabling integration with FlexKV, a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference. Signed-off-by: phaedonsun <[email protected]>

feiqiangs force-pushed the vllm_support_flexkv branch from 3e069f5 to cac2953 Compare March 11, 2026 12:04

Merge branch 'main' into vllm_support_flexkv

dcc6421

Merge branch 'main' into vllm_support_flexkv

24b45be

mgoin approved these changes Mar 12, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026

vllm-bot merged commit 8cb24d3 into vllm-project:main Mar 12, 2026
46 of 49 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[KV Connector] Support using FlexKV as KV Cache Offloading option. (v…

5e3ebeb

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[KV Connector] Support using FlexKV as KV Cache Offloading option. (v…

03f078b

…llm-project#34328) Signed-off-by: phaedonsun <[email protected]> Co-authored-by: phaedonsun <[email protected]>



		flexkv_config = {
		"server_recv_port": "ipc:///tmp/flexkv_test",

	"server_recv_port": "ipc:///tmp/flexkv_test",
	"server_recv_port": f"ipc:///tmp/flexkv_test_{os.getpid()}",

-model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B")
+model_path = os.environ.get("MODEL_PATH")
+if model_path is None:
+    raise ValueError("Please set the MODEL_PATH environment variable.")

		1. git clone [email protected]:taco-project/FlexKV.git
		2. cd FlexKV && bash build.sh

Uh oh!

Conversation

feiqiangs commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feiqiangs Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linhu-nv commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feiqiangs Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feiqiangs Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feiqiangs commented Mar 11, 2026

Uh oh!

linhu-nv commented Mar 12, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

feiqiangs commented Feb 11, 2026 •

edited by github-actions bot

Loading

feiqiangs Feb 28, 2026 •

edited

Loading

feiqiangs Mar 11, 2026 •

edited

Loading

feiqiangs Mar 11, 2026 •

edited

Loading