Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,9 @@ def __post_init__(self):
mem_fraction + 48 * 1024 * (1 - mem_fraction) / gpu_mem,
(gpu_mem - reserve_mem) / gpu_mem,
)
else:
if self.speculative_algorithm is not None:
self.mem_fraction_static *= 0.95
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 0.95 reduces the static memory fraction by 5%. To improve code readability and maintainability, could this be defined as a named constant at the class or module level? For example, SPECULATIVE_MEM_FRACTION_ADJUSTMENT_FACTOR = 0.95.

Suggested change
self.mem_fraction_static *= 0.95
self.mem_fraction_static *= _SPECULATIVE_MEM_FRACTION_ADJUSTMENT_FACTOR # Or a similarly named constant

Comment on lines +274 to +276
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new logic adjusts memory allocation for speculative decoding on smaller GPUs. Could a brief comment be added here to explain why this reduction is necessary (e.g., to provide more headroom for CUDA graph capture and other speculative decoding overheads)? This would help future maintainers understand the rationale.

            else:
                # For GPUs with less than 96GiB memory (or if memory size is unknown),
                # reduce the static memory fraction if speculative decoding is active.
                # This provides more headroom for CUDA graph captures and other overheads
                # associated with speculative decoding, helping to prevent OOM errors.
                if self.speculative_algorithm is not None:
                    self.mem_fraction_static *= 0.95


# Set chunked prefill size, which depends on the gpu memory capacity
if self.chunked_prefill_size is None:
Expand Down
Loading