-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[P/D] [NixlConnector] kv load recovery integration #26171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces failure recovery mechanisms for the NIXL KV connector. It adds error handling for transfer initiation failures and failures during block reads. When a failure occurs, the affected KV cache blocks are marked as invalid, and this information is propagated to the scheduler for retrying the request. The changes also include adding statistics for failed transfers and notifications, and rate-limiting for some log messages to prevent spam.
The overall approach is sound and significantly improves the robustness of the NIXL connector. However, I've found a critical issue where failed blocks are not reported correctly when use_host_buffer is disabled, which would prevent failure recovery in that configuration. I've left a comment with details on the issue and a suggested fix.
035e54b to
bfd1f52
Compare
84e9a53 to
23612e9
Compare
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Head branch was pushed to by a user without write access
86f86f0 to
52ab9f6
Compare
|
@njhill this needs a manual merge, had to rebase because of formatting changes 😬 |
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: Vladislav <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: 1994 <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: bbartels <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Will Eaton <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Purpose
This integrates
nixl_connectorwith additional scheduler features exposed in #19330 for retrying requests that have failed blocks.This PR also includes a small bugfix where if P crashes during zmq handshake, the D node's request status would get stuck in WAITING_FOR_REMOTE_KV forever.
Test Plan
For integration testing, tested injecting faults using a vllm process instrumented with https://github.com/wseaton/ucx-fault-injector/, which forces nixl exceptions to be thrown during transfer.
Logs
Future Work
Make this behavior opt-out via a global configuration option, and then enable aborting in the API server for the fail path, since this results in locall prefills on the decode node as the failure recovery mechanism.