Skip to content

[KV Connector] Support using FlexKV as KV Cache Offloading option.#34328

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
feiqiangs:vllm_support_flexkv
Mar 12, 2026
Merged

[KV Connector] Support using FlexKV as KV Cache Offloading option.#34328
vllm-bot merged 3 commits intovllm-project:mainfrom
feiqiangs:vllm_support_flexkv

Conversation

@feiqiangs
Copy link
Copy Markdown
Contributor

@feiqiangs feiqiangs commented Feb 11, 2026

FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching and a distributed KVCache pool to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:

ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QPM increases by 16%.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 11, 2026

Documentation preview: https://vllm--34328.org.readthedocs.build/en/34328/

@mergify mergify bot added documentation Improvements or additions to documentation kv-connector labels Feb 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FlexKV as a KV cache offloading option by adding a new FlexKVConnectorV1. The changes include the connector implementation, factory registration, documentation, and a new example script. The core connector is a thin wrapper around the flexkv library. The implementation looks good, but the example script prefix_caching_flexkv.py has a few issues that should be addressed to make it more robust and user-friendly. Specifically, it uses hardcoded paths for IPC and the model, which hurts portability, and relies on time.sleep() for synchronization, which is unreliable. My review provides suggestions to fix these issues.



flexkv_config = {
"server_recv_port": "ipc:///tmp/flexkv_test",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hardcoded IPC path ipc:///tmp/flexkv_test can lead to conflicts if multiple instances of this example are run concurrently on the same machine. This could cause unexpected failures or race conditions. To ensure the example is robust and can be run in parallel, a unique path should be generated for each run, for example by appending the process ID.

Suggested change
"server_recv_port": "ipc:///tmp/flexkv_test",
"server_recv_port": f"ipc:///tmp/flexkv_test_{os.getpid()}",

"kv_role": "kv_both",
}

model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default model path "/data0/models/Qwen3/Qwen3-32B" is hardcoded to a specific developer's environment. This will cause the example to fail for other users who do not have this path. To make the example runnable for everyone, it's better to remove the default value and require the user to set the MODEL_PATH environment variable, raising an error if it's not set.

Suggested change
model_path = os.environ.get("MODEL_PATH", "/data0/models/Qwen3/Qwen3-32B")
model_path = os.environ.get("MODEL_PATH")
if model_path is None:
raise ValueError("Please set the MODEL_PATH environment variable.")

prefix_cached_llm.generate(generating_prompts[0], sampling_params)

# wait for offload kv task finished.
time.sleep(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using time.sleep(2) to wait for an asynchronous offload task to complete is unreliable. The task might take more or less time depending on system load and other factors. This can lead to flaky behavior in the example, where it might fail intermittently or have unnecessary delays. A more robust synchronization mechanism should be used. If the FlexKV API provides a way to block until the operation is complete (e.g., a future, an event, or a blocking call), it should be used instead. This would make the example more reliable and demonstrate better programming practices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right that time.sleep(2) is not a reliable synchronization mechanism. The offload task here is an asynchronous background operation managed internally by FlexKV. Unfortunately, the current LLM (offline inference) API does not expose a blocking interface to wait for the offload to complete — the synchronization (wait_for_save) happens internally at the worker level during the forward pass. In practice, 2 seconds is more than sufficient for the async KV transfer task to complete in typical scenarios, so this works reliably for the purpose of this example.

print(f"Generated answers are the same: {generated_same}")

# wait for offload kv task finished.
time.sleep(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, using time.sleep(2) here is unreliable for synchronizing with an asynchronous task. This can make the example flaky. Please use a proper synchronization primitive from the FlexKV API if one is available.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same case as the previous comment — the 2-second sleep is intentional here for the same reason.

vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config
)
try:
from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the vLLM APIs used by this connector:

# vllm
from vllm.distributed.kv_transfer.kv_connector.v1.base import (
    KVConnectorMetadata, KVConnectorRole)
from vllm.distributed.kv_transfer.kv_connector.v1.metrics import KVConnectorStats
from vllm.distributed.parallel_state import get_tp_group

if TYPE_CHECKING:
    from vllm.config import VllmConfig
    from vllm.v1.core.sched.output import SchedulerOutput
    from vllm.attention.backends.abstract import AttentionMetadata
    from vllm.distributed.kv_events import KVCacheEvent
    from vllm.forward_context import ForwardContext
    from vllm.v1.core.kv_cache_manager import KVCacheBlocks
    from vllm.v1.request import Request
    from vllm.v1.outputs import KVConnectorOutput

I don't think we (the vLLM project) commit to backwards compatibility with these APIs, so it's quite likely the connector will need to be updated as new vLLM releases come out

(This is really just an FYI - I don't have an alternate recommendation, and I can't even be confidently specific about which of these APIs we do expect to remain backwards compatible)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up! We understand that these internal APIs are subject to change and vLLM does not guarantee backwards compatibility. To address this, we have added type checking and compatibility logic in the FlexKV-vLLM adapter code to better handle potential changes. We (the FlexKV team) will keep an eye on vLLM updates and maintain the connector to ensure compatibility with future releases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also setup a test job in CI that installs FlexKV so we know when it breaks

Copy link
Copy Markdown
Contributor Author

@feiqiangs feiqiangs Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also setup a test job in CI that installs FlexKV so we know when it breaks

Thanks for the suggestion! We've added a CI workflow in the FlexKV repository to track vLLM API compatibility: .github/workflows/vllm-compat-test.yml

It runs on a daily schedule and covers three checks:

All vLLM internal symbols used by FlexKVConnectorV1 are still importable.

FlexKVConnectorV1 implements every abstract method required by KVConnectorBase_V1.

All overridden methods still exist in the base class.

We think it's more appropriate to maintain this CI test in the FlexKV repository itself rather than adding it to vLLM's CI, to avoid introducing external dependencies into vLLM's test pipeline.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 27, 2026

Hi @feiqiangs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 27, 2026

Hi @feiqiangs, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Comment on lines +11 to +12
1. git clone [email protected]:taco-project/FlexKV.git
2. cd FlexKV && bash build.sh
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider pinning to a branch or commit? I'm not sure how active development is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! We intentionally keep this as a clone of the main branch rather than pinning to a specific tag/commit, because FlexKV maintains a dedicated CI workflow (vllm-compat-test.yml) that continuously validates compatibility with vLLM. This ensures the main branch of FlexKV is always compatible with the corresponding vLLM version. We (the FlexKV team) are committed to keeping the main branch in a release-ready state at all times. Pinning to a specific commit would require frequent updates to this example whenever FlexKV releases bug fixes.

vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config
)
try:
from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also setup a test job in CI that installs FlexKV so we know when it breaks

@feiqiangs feiqiangs force-pushed the vllm_support_flexkv branch 5 times, most recently from b6cc98a to 0dfe92a Compare March 2, 2026 08:49
@linhu-nv
Copy link
Copy Markdown

linhu-nv commented Mar 4, 2026

@markmc @mgoin we have address the comments, can you please help kick off CI? thanks

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Mar 9, 2026

Usage:
1. Run this script:
python examples/offline_inference/prefix_caching_flexkv.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you are missing the MODEL_PATH env set in your example. You could also just make it a CLI arg

Copy link
Copy Markdown
Contributor Author

@feiqiangs feiqiangs Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've updated the example to use --model as a required CLI argument instead of the MODEL_PATH env var. The usage is now:

python examples/offline_inference/prefix_caching_flexkv.py \
    --model /path/to/your/model

from flexkv.integration.vllm.vllm_v1_adapter import FlexKVConnectorV1Impl
except ImportError as e:
raise ImportError(
"FlexKV is not installed. Please install it to use FlexKVConnectorV1. "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to share a link or instructions to install so the user can be guided properly

Copy link
Copy Markdown
Contributor Author

@feiqiangs feiqiangs Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've updated the docstring to include both a link and step-by-step installation instructions. The current code now reads:

Installation:
See https://github.com/taco-project/FlexKV for installation instructions.
Quick start::

    git clone [email protected]:taco-project/FlexKV.git
    cd FlexKV && bash build.sh

Additionally, the ImportError message also directs users to the GitHub repo:

 raise ImportError(
      "FlexKV is not installed. Please install it to use "
      "FlexKVConnectorV1. See https://github.com/taco-project/FlexKV "
      "for installation instructions."
 ) from e

This should give users clear guidance

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in getting back to you on this one @feiqiangs .

Do we have any tests that we can run along with this connector?

Comment on lines +65 to +85
**kwargs (Any): additional arguments for the load operation

Note:
The number of elements in kv_caches and layer_names should be
the same.

"""
self._flexkv_connector.start_load_kv(forward_context, **kwargs)

def wait_for_layer_load(self, layer_name: str) -> None:
"""
Block until the KV for a specific layer is loaded into vLLM's
paged buffer. This is called from within attention layer to ensure
async copying from start_load_kv is complete.

This interface will be useful for layer-by-layer pipelining.

Args:
layer_name: the name of that layer
"""
self._flexkv_connector.wait_for_layer_load(layer_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you making use of both these paths here?
Usually connectors either load by layer blocking or load all layers async, but I dont see a flag to switch logic here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question, and apologies for the confusing docstrings!

To clarify the design: FlexKV uses a scheduler-side transfer model (similar to NIXL's scheduler-side coordination) — all KV transfers are coordinated by the scheduler connector — build_connector_meta calls launch_tasks() to kick off async transfers, and update_connector_output polls query_finished_task() for completion. KV blocks are moved directly between the FlexKV server and vLLM's GPU memory without any worker-side intervention during the forward pass.

So to answer your question: we are not using both paths — start_load_kv, wait_for_layer_load, save_kv_layer, and wait_for_save are all currently no-ops. We keep them because:

  1. KVConnectorBase_V1 requires implementing these methods
  2. They serve as extension points for a potential future worker-side layer-pipelining optimisation

We've updated the docstrings in the latest commit to make this explicit. Sorry for the confusion caused by the misleading comments!

@feiqiangs feiqiangs force-pushed the vllm_support_flexkv branch from 16041e9 to d46be5f Compare March 11, 2026 06:05
@mergify mergify bot added the v1 label Mar 11, 2026
@feiqiangs feiqiangs force-pushed the vllm_support_flexkv branch from d46be5f to 88464f9 Compare March 11, 2026 06:16
@feiqiangs
Copy link
Copy Markdown
Contributor Author

Apologies for the delay in getting back to you on this one @feiqiangs .

Do we have any tests that we can run along with this connector?

No worries at all @NickLucche — apologies for the delay on my end as well.

Yes, I've added unit tests for the FlexKV connector. You can find them at:

tests/v1/kv_connector/unit/test_flexkv_connector.py

The tests mock the external flexkv package (similar to how test_lmcache_connector.py mocks lmcache) and cover:

  • Import error handling: Verifies a clear ImportError is raised when flexkv is not installed.
  • Delegation of all public API methods: Ensures each method on FlexKVConnectorV1 correctly delegates to the underlying FlexKVConnectorV1Impl, including:
    • bind_connector_metadata
    • start_load_kv
    • wait_for_layer_load
    • save_kv_layer
    • wait_for_save
    • get_num_new_matched_tokens
    • update_state_after_alloc
    • build_connector_meta
    • request_finished
    • get_pending_request_count

This commit introduces the FlexKV connector, enabling integration with FlexKV, a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference.

Signed-off-by: phaedonsun <[email protected]>
@feiqiangs feiqiangs force-pushed the vllm_support_flexkv branch from 3e069f5 to cac2953 Compare March 11, 2026 12:04
@linhu-nv
Copy link
Copy Markdown

@mgoin @NickLucche Thanks for the review and suggestions. Seems the pr is ready now? Can you please approve this PR to push this forward? Thanks

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating. Will keep the CI in the external package for now then

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026
@vllm-bot vllm-bot merged commit 8cb24d3 into vllm-project:main Mar 12, 2026
46 of 49 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…llm-project#34328)

Signed-off-by: phaedonsun <[email protected]>
Co-authored-by: phaedonsun <[email protected]>
Signed-off-by: Monishver Chandrasekaran <[email protected]>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants