-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Add decode req pool #6980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decode req pool #6980
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -912,12 +912,26 @@ def init_memory_pool( | |||||
| ) | ||||||
|
|
||||||
| if self.req_to_token_pool is None: | ||||||
| self.req_to_token_pool = ReqToTokenPool( | ||||||
| size=max_num_reqs, | ||||||
| max_context_len=self.model_config.context_len + 4, | ||||||
| device=self.device, | ||||||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||||||
| ) | ||||||
| if self.server_args.disaggregation_mode == "decode": | ||||||
| from sglang.srt.disaggregation.decode import DecodeReqToTokenPool | ||||||
|
|
||||||
| # subscribe memory for pre-allocated requests | ||||||
| # if max_num_reqs <= 32, we pre-allocate 2x requests | ||||||
| pre_alloc_size = max_num_reqs * 2 if max_num_reqs <= 32 else 0 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The numbers Could these be refactored into constants? For example: # At module level or in a config class
_PRE_ALLOC_SIZE_MULTIPLIER = 2
_MAX_NUM_REQS_THRESHOLD_FOR_PRE_ALLOC = 32
# In the function
pre_alloc_size = (max_num_reqs * _PRE_ALLOC_SIZE_MULTIPLIER
if max_num_reqs <= _MAX_NUM_REQS_THRESHOLD_FOR_PRE_ALLOC
else 0)
Suggested change
|
||||||
| self.req_to_token_pool = DecodeReqToTokenPool( | ||||||
| size=max_num_reqs, | ||||||
| max_context_len=self.model_config.context_len + 4, | ||||||
| device=self.device, | ||||||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||||||
| pre_alloc_size=pre_alloc_size, | ||||||
| ) | ||||||
| else: | ||||||
| self.req_to_token_pool = ReqToTokenPool( | ||||||
| size=max_num_reqs, | ||||||
| max_context_len=self.model_config.context_len + 4, | ||||||
| device=self.device, | ||||||
| enable_memory_saver=self.server_args.enable_memory_saver, | ||||||
| ) | ||||||
| else: | ||||||
| # Draft worker shares req_to_token_pool with the target worker. | ||||||
| assert self.is_draft_worker | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
allocmethod currently uses list slicing (self.free_slots[:need_size]andself.free_slots = self.free_slots[need_size:]) to manageself.free_slots. Ifself.free_slotscan become very large (e.g., thousands of entries, which is possible givenmax_num_reqscan be large) andallocis called frequently, these O(N) list operations (where N is the length offree_slots) could potentially become a performance bottleneck.Have you considered using
collections.dequeforself.free_slots? Adequewould allow for O(1) appends (forfree) and O(1)popleftoperations. Allocatingneed_sizeitems would then be O(need_size), which could be more efficient than O(N) list slicing if N is large andneed_sizeis relatively small. This would require changing the type ofself.free_slotsin__init__andclearas well.