-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[fix] more mem for draft_extend cuda_graph #6726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -271,6 +271,9 @@ def __post_init__(self): | |
| mem_fraction + 48 * 1024 * (1 - mem_fraction) / gpu_mem, | ||
| (gpu_mem - reserve_mem) / gpu_mem, | ||
| ) | ||
| else: | ||
| if self.speculative_algorithm is not None: | ||
| self.mem_fraction_static *= 0.95 | ||
|
Comment on lines
+274
to
+276
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This new logic adjusts memory allocation for speculative decoding on smaller GPUs. Could a brief comment be added here to explain why this reduction is necessary (e.g., to provide more headroom for CUDA graph capture and other speculative decoding overheads)? This would help future maintainers understand the rationale. else:
# For GPUs with less than 96GiB memory (or if memory size is unknown),
# reduce the static memory fraction if speculative decoding is active.
# This provides more headroom for CUDA graph captures and other overheads
# associated with speculative decoding, helping to prevent OOM errors.
if self.speculative_algorithm is not None:
self.mem_fraction_static *= 0.95 |
||
|
|
||
| # Set chunked prefill size, which depends on the gpu memory capacity | ||
| if self.chunked_prefill_size is None: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number
0.95reduces the static memory fraction by 5%. To improve code readability and maintainability, could this be defined as a named constant at the class or module level? For example,SPECULATIVE_MEM_FRACTION_ADJUSTMENT_FACTOR = 0.95.