UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models by loci-dev · Pull Request #590 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-16T14:41:46Z

The way the number of unassigned layers are being calculated is not quite correct, resulting in a numerical underflow because a value is unintentionally being subtracted more than once.

loci-review · 2025-12-16T15:38:26Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #590

Analysis Scope: Single file modification (src/llama.cpp) affecting the llama_params_fit_impl function in the memory allocation subsystem.

Code Changes: This PR fixes a numerical underflow bug in multi-GPU layer distribution logic. The fix moves the n_unassigned variable calculation inside the loop iteration, recalculating it explicitly rather than maintaining it as a loop-carried state variable. This prevents double-subtraction that caused incorrect layer allocation across devices.

Performance Impact: The modified function is not in the inference hot path. It executes once during model initialization, not during token generation. Analysis of the top 10 functions with highest response time changes shows no modifications to inference-critical functions such as llama_decode, llama_encode, or llama_tokenize.

Tokens Per Second Impact: No impact on inference throughput. The changed function handles device memory allocation during model loading and does not participate in the token generation pipeline. The fix adds an O(nd²) inner loop where nd represents the number of devices (typically 1-8), resulting in microsecond-level overhead during initialization only.

Power Consumption: Analysis shows build.bin.libllama.so has a 0.114% reduction in power consumption (186,068 nJ to 185,856 nJ), representing a 212 nJ improvement. This minor improvement is attributed to optimizations in KV cache accessor functions (llama_kv_cells::is_empty, llama_kv_cells::get_shift, llama_kv_cells::pos_get) which showed 20-29% throughput reductions in absolute terms of 37-49 ns per call. These functions are unrelated to the PR changes and represent baseline measurement variance.

Conclusion: This correctness fix has no measurable impact on inference performance or tokens per second. The changes ensure proper multi-GPU memory allocation without affecting the token generation pipeline.

llama-fit-params: fix underflow for dense models

4fc63e1

loci-dev temporarily deployed to PROD__AL_DEMO December 16, 2025 14:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 2e88b20 to e02e9be Compare December 19, 2025 08:12

loci-dev force-pushed the main branch 30 times, most recently from 15838f1 to 006b713 Compare December 24, 2025 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590

UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18095-branch_JohannesGaessler-llama-fp-fix-dense-underflow

loci-dev commented Dec 16, 2025

Uh oh!

loci-review bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 16, 2025

Uh oh!

loci-review bot commented Dec 16, 2025

Performance Analysis Summary: PR #590

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants