UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590
UPSTREAM PR #18095: llama-fit-params: fix underflow for dense models#590
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #590Analysis Scope: Single file modification (src/llama.cpp) affecting the Code Changes: This PR fixes a numerical underflow bug in multi-GPU layer distribution logic. The fix moves the Performance Impact: The modified function is not in the inference hot path. It executes once during model initialization, not during token generation. Analysis of the top 10 functions with highest response time changes shows no modifications to inference-critical functions such as llama_decode, llama_encode, or llama_tokenize. Tokens Per Second Impact: No impact on inference throughput. The changed function handles device memory allocation during model loading and does not participate in the token generation pipeline. The fix adds an O(nd²) inner loop where nd represents the number of devices (typically 1-8), resulting in microsecond-level overhead during initialization only. Power Consumption: Analysis shows build.bin.libllama.so has a 0.114% reduction in power consumption (186,068 nJ to 185,856 nJ), representing a 212 nJ improvement. This minor improvement is attributed to optimizations in KV cache accessor functions (llama_kv_cells::is_empty, llama_kv_cells::get_shift, llama_kv_cells::pos_get) which showed 20-29% throughput reductions in absolute terms of 37-49 ns per call. These functions are unrelated to the PR changes and represent baseline measurement variance. Conclusion: This correctness fix has no measurable impact on inference performance or tokens per second. The changes ensure proper multi-GPU memory allocation without affecting the token generation pipeline. |
2e88b20 to
e02e9be
Compare
15838f1 to
006b713
Compare
Mirrored from ggml-org/llama.cpp#18095
Fixes ggml-org/llama.cpp#18087 .
The way the number of unassigned layers are being calculated is not quite correct, resulting in a numerical underflow because a value is unintentionally being subtracted more than once.