-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix][Nixl] Fix kernel physical<>logical block_size issue #28677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Nixl] Fix kernel physical<>logical block_size issue #28677
Conversation
Signed-off-by: NickLucche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug where the NixlConnector fails to use the correct physical kernel block_size, particularly affecting deployments with block_size=128. The changes correctly detect the kernel block size at runtime and update the connector's state. It also introduces a necessary mapping from logical to physical block IDs to ensure correct block access.
My review identifies a critical issue in the logic used to determine the kernel_block_size. The current implementation is fragile and may fail for several attention backends, potentially leading to memory corruption. I've provided a more robust code suggestion to handle different KV cache layouts correctly. The rest of the changes for logical-to-physical block mapping appear to be well-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # NHD default "view" for non-MLA cache | ||
| kernel_block_size = cache.shape[-2] if self.use_mla else cache.shape[-3] | ||
|
|
||
| if self.block_size != kernel_block_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a silly question, when will this scenario happen? what is the max kernel block size for CUDA? where is it set?
if self.block_size != kernel_block_size
@jikunshang , may you check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's backend-dependent, it happens every time the supplied block_size is not one of https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/flash_attn.py#L63 (so kernel one is used for physical tensors and block_size becomes only logical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see
xuechendi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, looks good to me
Signed-off-by: NickLucche <[email protected]>
|
@codex review |
|
@xuechendi I moved up the logical<>physical conversion so that permutation works as intended. Tested with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: NickLucche <[email protected]>
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
Lines 1516 to 1535 in f28be20
| def save_kv_to_host(self, metadata: NixlConnectorMetadata): | |
| """copy kv from device to host buffer.""" | |
| assert self.use_host_buffer | |
| assert self.copy_blocks is not None | |
| for req_id, meta in metadata.reqs_to_save.items(): | |
| if logger.isEnabledFor(logging.DEBUG): | |
| logger.debug( | |
| "save_load_kv for request[%s] to host xfer buffer." | |
| "local_block_ids: %s. ", | |
| req_id, | |
| ",".join(map(str, meta.local_block_ids)), | |
| ) | |
| # blocking | |
| self.copy_blocks( | |
| self.device_kv_caches, | |
| self.host_xfer_buffers, | |
| meta.local_block_ids, | |
| meta.local_block_ids, | |
| "d2h", |
The new _logical_to_kernel_block_ids conversion is applied when pulling remote blocks, but save_kv_to_host still uses the unconverted meta.local_block_ids when copying device KV into the host staging buffer. When the runtime shrinks the kernel block size (e.g. logical 128 → physical 64), those IDs are still logical, so the host buffer ends up containing different physical blocks than the ones requested later by _read_blocks, which now uses converted indices. Any deployment that relies on use_host_buffer will therefore transfer mismatched blocks after this change. The host-buffer path should run the same logical→physical mapping before calling copy_blocks.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Further separated logical<>physical blocks in ReqMeta so that failures can still be reported using logical blocks ("user/engine-space" rather than "kernel-space"). cc @wseaton Also, tested hd2dh use-case with: |
Signed-off-by: NickLucche <[email protected]>
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
…oject#28677) Signed-off-by: NickLucche <[email protected]> Signed-off-by: George D. Torres <[email protected]>
Signed-off-by: NickLucche <[email protected]> (cherry picked from commit 96b23b8)
|
cc @nvjullin |
…oject#28677) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
…oject#28677) Signed-off-by: NickLucche <[email protected]>
…oject#28677) Signed-off-by: NickLucche <[email protected]>
Fix #28661.
PD deployments with block_size=128 on FA (as well as FI) are currently broken on main.
This is pretty bad because 128 is the suggested
block_sizeas it maximizes the xfer window.This is due to the changes introduced in #24486, which adjust
block_sizeat runtime based on Backend constraints.This update value is not picked by NixlConnector and it ends up using the one defined by the user (to be precise, the updated value it's not stored anywhere).
Example
Furthermore,
kv_cache_manager.get_block_ids()returns manager (logical) block IDs, but the connector (particularly nixl_connector) needs kernel-space block IDs.Currently there's not a clean accessible way to map from logical->physical block_ids from anywhere in code. Also it is not clear in terms of Connector interface contract, whether blocks supplied by the Scheduler should be logical or already physical.
For the sake of getting this bug fixed, I have implemented the mapping logic within the connector worker.
Bug was spotted after #27753 landed, as it changed the default FA backend to not support 128 anymore.
Test with
Same procedure as described on issue #28661:
cc @heheda12345 @njhill @robertgshaw2-redhat