-
Notifications
You must be signed in to change notification settings - Fork 6
[P/D Disagg] Direct NIXL Connector #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P/D Disagg] Direct NIXL Connector #60
Conversation
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
* updated Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
* updated Signed-off-by: [email protected] <[email protected]> * stash Signed-off-by: [email protected] <[email protected]> * cleanup Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]>
- fix spelling - CUDA_VISIBLE_DEVICES should be set externally Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
* updated Signed-off-by: [email protected] <[email protected]> * fix merge Signed-off-by: [email protected] <[email protected]> * updated Signed-off-by: [email protected] <[email protected]> * updated Signed-off-by: [email protected] <[email protected]> * updated Signed-off-by: [email protected] <[email protected]> * updated Signed-off-by: [email protected] <[email protected]> --------- Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
…into nixl_integration
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @tlrmchlsmth (and @robertgshaw2-redhat)! Looks really good. I've reviewed everything but nixl_connector.py, will look at that next.
All my comments so far are pretty minor things / shouldn't be blockers.
vllm/v1/worker/gpu_model_runner.py
Outdated
| def maybe_wait_for_save(): | ||
| if has_kv_transfer_group(): | ||
| kv_connector = get_kv_transfer_group() | ||
| kv_connector.wait_for_save() | ||
|
|
||
| def maybe_get_finished() -> tuple[set[str], set[str]]: | ||
| if has_kv_transfer_group(): | ||
| kv_connector = get_kv_transfer_group() | ||
| return kv_connector.get_finished() | ||
| else: | ||
| return set(), set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these could all be global functions?
Not this PR but I also don't think the get_kv_transfer_group / has_kv_transfer_group is needed, may as well just have the latter and return None if no connector is configured.
| self._update_states(scheduler_output) | ||
| if not scheduler_output.total_num_scheduled_tokens: | ||
| # KV send/recv even if no work to do. | ||
| with set_forward_context(None, self.vllm_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set the forward context here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward context gets passed into the KV connector.
We need to change this as it's not a stable API. And we should do this very soon before there's more KVConnector adoption
| # These transfers are designed to be async and the requests | ||
| # involved may be disjoint from the running requests. | ||
| # Do this here to save a collective_rpc. | ||
| kv_connector.start_load_kv(get_forward_context()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to just pass None? (make forward context optional?)
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor things
| first_layer_name = next(iter(kv_caches)) | ||
| first_kv_cache = kv_caches[first_layer_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| first_layer_name = next(iter(kv_caches)) | |
| first_kv_cache = kv_caches[first_layer_name] | |
| first_layer_name, first_kv_cache = next(iter(kv_caches.items())) |
| # [2 (k and v), num_blocks, ...] | ||
| # TODO(tms): num_blocks will be in a different spot for MLA. | ||
| num_blocks = first_kv_cache.shape[1] | ||
| kv_elem_size = first_kv_cache[0].element_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| kv_elem_size = first_kv_cache[0].element_size() | |
| kv_elem_size = first_kv_cache.element_size() |
| # hybrid attn, etc | ||
| self.block_len = kv_elem_size * math.prod(first_kv_cache.shape[-3:]) | ||
|
|
||
| logger.debug("Per layer kv cache size: %s", first_kv_cache[0].shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think?
| logger.debug("Per layer kv cache size: %s", first_kv_cache[0].shape) | |
| logger.debug("Per layer kv cache size: %s", first_kv_cache.shape) |
| for layer_name in kv_caches: | ||
| for cache in kv_caches[layer_name]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for layer_name in kv_caches: | |
| for cache in kv_caches[layer_name]: | |
| for caches in kv_caches.values(): | |
| for cache in caches: |
And we are iterating over slices of the first dimension of the tensor here, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to leave this as-is (along with some of the other comments around here) since I'm not sure it's right. Here, order is important and I'm not sure the regions will end up being ordered by layer if we make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Though this particular change at least is equivalent. When you iterate over a dict it's always in the same order whether you're iterating over the keys, values or items. I guess I should have suggested layer rather than caches as the variable name
| kv_caches[first_layer_name][1, b, 0, 0, 0]) | ||
| remote_engine_id = None # HACK for debug send | ||
|
|
||
| if NIXL_ROLE == "SENDER": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be also be nested under the dev if above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is still needed for handshaking
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock if needed, I don't think any of my comments are critical
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
| # Note(tms): The remote_block_ids only contain full computed blocks, | ||
| # while the local_block_ids are all blocks allocated for this request, | ||
| # so truncate the local_block_ids to account for this. | ||
| del local_block_ids[len(remote_block_ids):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: im not sure this is sound. We need to double check this.
Also now that num_preallocate_tokens=0, we should not have this case anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we should remove this. Just making a note here for posterity
No description provided.