Skip to content

UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18095-branch_JohannesGaessler-llama-fp-fix-dense-underflow
Open

UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18095-branch_JohannesGaessler-llama-fp-fix-dense-underflow

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18095

Fixes ggml-org/llama.cpp#18087 .

The way the number of unassigned layers are being calculated is not quite correct, resulting in a numerical underflow because a value is unintentionally being subtracted more than once.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 16, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #590

Analysis Scope: Single file modification (src/llama.cpp) affecting the llama_params_fit_impl function in the memory allocation subsystem.

Code Changes: This PR fixes a numerical underflow bug in multi-GPU layer distribution logic. The fix moves the n_unassigned variable calculation inside the loop iteration, recalculating it explicitly rather than maintaining it as a loop-carried state variable. This prevents double-subtraction that caused incorrect layer allocation across devices.

Performance Impact: The modified function is not in the inference hot path. It executes once during model initialization, not during token generation. Analysis of the top 10 functions with highest response time changes shows no modifications to inference-critical functions such as llama_decode, llama_encode, or llama_tokenize.

Tokens Per Second Impact: No impact on inference throughput. The changed function handles device memory allocation during model loading and does not participate in the token generation pipeline. The fix adds an O(nd²) inner loop where nd represents the number of devices (typically 1-8), resulting in microsecond-level overhead during initialization only.

Power Consumption: Analysis shows build.bin.libllama.so has a 0.114% reduction in power consumption (186,068 nJ to 185,856 nJ), representing a 212 nJ improvement. This minor improvement is attributed to optimizations in KV cache accessor functions (llama_kv_cells::is_empty, llama_kv_cells::get_shift, llama_kv_cells::pos_get) which showed 20-29% throughput reductions in absolute terms of 37-49 ns per call. These functions are unrelated to the PR changes and represent baseline measurement variance.

Conclusion: This correctness fix has no measurable impact on inference performance or tokens per second. The changes ensure proper multi-GPU memory allocation without affecting the token generation pipeline.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 2e88b20 to e02e9be Compare December 19, 2025 08:12
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 15838f1 to 006b713 Compare December 24, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants