-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[NIXL] heterogeneous block_size support #26759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NIXL] heterogeneous block_size support #26759
Conversation
Signed-off-by: Chendi Xue <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chendi Xue <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
5431f65 to
9e5b623
Compare
Current codes only works for NHD Signed-off-by: Chendi Xue <[email protected]>
f141077 to
0fc409c
Compare
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
0fc409c to
e6e3d92
Compare
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Chendi Xue <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
d7c17ea to
984637d
Compare
| DECODER_TP_SIZE=${DECODER_TP_SIZE:-1} | ||
| GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.2} | ||
| PREFILL_BLOCK_SIZE=${PREFILL_BLOCK_SIZE:-16} | ||
| DECODE_BLOCK_SIZE=${DECODE_BLOCK_SIZE:-16} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to 128 but switched back to 16 again.
I tested with 128, and noticed that even on origin/main, accuracy is not correct at this moment.
Will see if I can find out main reason in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified setting block_size=64 works, but somehow using block_size=128 for CUDA gets tensor Nan.
@NickLucche , do you want me to set as 64? Actually I refer to use current default for CUDA which is 16/
|
@NickLucche , comments are mostly resolved, only one is for default block_size, I explained in another comments, I verified accuracy on this Branch and current main, both accuracy is not correct. So I switched back to 16 |
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your patience @xuechendi! Left two minor comments but it would be nice if you could address those quickly before we turn auto-merge on.
| assert ( | ||
| self.block_size % remote_block_size == 0 | ||
| or remote_block_size % self.block_size == 0 | ||
| ), ( | ||
| f"Local block size {self.block_size} is not divisible " | ||
| f"by remote block size {remote_block_size} or vice versa." | ||
| ) | ||
| ret = self.block_size / remote_block_size | ||
| return ret if ret < 0 else int(ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I still find it could be confusing to readers.
| assert ( | |
| self.block_size % remote_block_size == 0 | |
| or remote_block_size % self.block_size == 0 | |
| ), ( | |
| f"Local block size {self.block_size} is not divisible " | |
| f"by remote block size {remote_block_size} or vice versa." | |
| ) | |
| ret = self.block_size / remote_block_size | |
| return ret if ret < 0 else int(ret) | |
| assert self.block_size % remote_block_size == 0, ( | |
| f"Local block size {self.block_size} is not divisible " | |
| f"by remote block size {remote_block_size}." | |
| ) | |
| return self.block_size // remote_block_size |
Happy to add it back if/when we land the opposite case.
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
1c85cfb to
a6641c2
Compare
|
Hi, @NickLucche , thanks for the review, I have fixed the last two comments, and tested with text_nixl_connector.py UT with small fixes to those as well. Please help to review again |
Signed-off-by: Chendi Xue <[email protected]>
|
Test is failing on main, force merging |
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Signed-off-by: George D. Torres <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
Purpose
To support scenarios when prefill and decode have their own preferred block_size. Ex: Prefill with CUDA(16 as block_size) and decode with Intel Gaudi(128 as block_size)
More details describe in #26744
Current status:
CMD for test:
use case 1:
Accuracy is Ok.
Accuracy is Ok.
use case 2:
when setting gpu_utilization=0.8, accuracy is good
When setting gpu_utilization=0.3, we might used up all block_ids, and we can't use tail block_ids for temp buffer to store prefill block before permute, accuracy might gets slight impact => detail please refer to below [Prefill Block Size < Decode Block Size ] - 3.get_finished()
Design doc:
Case 1: nP < nD
PREFILL block size < Decode block size: (example block_size_ratio = 0.25)
1.1 we register remote address using remote layout
1.2 we create a new local_xfer_handler using remote block_len, so it can do "one on one" remote to local copy
read_blocks
2.1 remap local_block_ids: block [1, 2, 3, 4] => block [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
-> In that case, local blocks and remote blocks are size and count alligned
get_finished()
3.1 For HND, do permute on 4 blocks buffer
3.2 For NHD, no permute needed
Case 2: nP > nD
For Prefill Block Size < Decode Block Size (example block_size_ratio = 4)
read_blocks
2.1 re-map remote_block_ids => block[0] => block [1, 2, 3, 4]
-> In that case, local blocks and remote blocks are size and count alligned
2.2 If len(local_block_ids) < len(after_mapping_remote_block_ids)
-> allocate block_id from end of the blockAllocator to local_block_ids
-> For ex: [1, 2, pad, pad ] => [1, 2, 12416, 12415] (Reason is at 3.1)
get_finished()
3.1 For HND, do permute on 4 blocks buffer
///// is not used token, it's hidden in copied buffer, we will permute to move them to the tail
Ex: we don't need block3, but with larger block_size, it was part of each head. We need to firstly
copy entire buffer, then permute to move block3 to tail and get rid of it.
That is also why need a temp buffer in local to have enough to store remote buffer
3.2 For NHD, no permute needed
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.