Skip to content

UPSTREAM PR #17276: ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc#215

Open
DajanaV wants to merge 4 commits intomainfrom
upstream-PR17276-branch_ggml-org-sl/realloc-error
Open

UPSTREAM PR #17276: ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc#215
DajanaV wants to merge 4 commits intomainfrom
upstream-PR17276-branch_ggml-org-sl/realloc-error

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 14, 2025

Mirrored from ggml-org/llama.cpp#17276

@loci-review
Copy link

loci-review bot commented Nov 14, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Pull Request #215 introduces the GGML_NO_REALLOC debugging feature to disable memory reallocations in the GGML allocator system. This change is intended to improve debugging capabilities by preventing dynamic memory growth during inference and enabling detection of unexpected reallocation patterns.

Key Findings

Performance Impact

The highest percentage changes occurred in non-core utility functions:

  • nearest_int function: +26.3% Response Time (+48 ns absolute), +28.5% Throughput Time (+49 ns absolute)
  • _M_get_insert_unique_pos function: +31.2% Throughput Time (+88 ns absolute), +9.6% Response Time (+88 ns absolute)

Core Inference Functions: No performance changes were detected in critical inference functions (llama_decode, llama_encode, llama_tokenize), indicating no impact on tokens per second performance.

Power Consumption Analysis

Minimal power consumption changes across all binaries:

  • libggml-base.so: -0.154% decrease (87,918 vs 88,053 nJ)
  • All other binaries: 0.0% change, indicating stable energy efficiency

Technical Analysis

Flame Graph: The nearest_int function shows increased security overhead with stack protection (__stack_chk_fail) and assertion checking (__assert_fail) contributing 6% of execution time.

CFG Comparison: Assembly analysis reveals block separation in the new version, where stack canary validation was moved to a separate block, causing a 102% increase in main logic block execution time despite identical instruction count.

Code Review Insights

The performance regression is an indirect effect of the GGML_NO_REALLOC compilation flag rather than direct code changes. The flag alters memory allocation patterns and compiler optimization behavior, affecting cache locality and instruction scheduling for quantization operations.

Conclusion

While percentage changes appear significant, the absolute impact is minimal (under 100 ns). The changes represent a valid debugging enhancement with negligible impact on overall inference performance, as core tokenization and inference functions remain unaffected.

@DajanaV DajanaV force-pushed the main branch 21 times, most recently from a6141bf to e336e72 Compare November 17, 2025 12:14
@DajanaV DajanaV force-pushed the upstream-PR17276-branch_ggml-org-sl/realloc-error branch from 6d90fe9 to 0710d5f Compare November 17, 2025 20:36
@loci-review
Copy link

loci-review bot commented Nov 17, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: GGML_NO_REALLOC Option Implementation

Overview

PR #215 introduces a debugging option (GGML_SCHED_NO_REALLOC) to disable memory reallocations in GGML's allocator. The changes span 6 files with build system integration, enhanced debug logging, and conditional compilation guards. While the implementation is functionally sound, performance analysis reveals localized regressions in quantization functions.

Key Findings

Highest Performance Impact:

  • ggml-quants.c_nearest_int: +26.3% Response Time increase (185 ns → 233 ns), +28.5% Throughput increase (170 ns → 219 ns)
  • _M_get_insert_unique_pos (STL): +31.2% Throughput increase (283 ns → 372 ns)

Core Function Impact Assessment:
The performance regressions do not affect core inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second throughput. The affected functions are:

  • Quantization utilities (nearest_int) - used during model loading/conversion, not real-time inference
  • STL container operations - memory management overhead, not inference path

Tokens Per Second Impact: Negligible. Since core inference functions remain unaffected, the model's inference throughput should maintain baseline performance.

Power Consumption Analysis:
The build.bin.libggml-base.so binary shows a 0.56% power consumption reduction (2062 nJ → 2050 nJ), indicating net efficiency gains despite localized regressions.

Flame Graph & CFG Analysis:
The nearest_int regression stems from enhanced stack protection mechanisms and separated control flow for error handling, not algorithmic changes. The CFG comparison reveals additional basic blocks for stack canary validation, explaining the 102% execution time increase in the main computation block.

Code Review Insights:
The implementation correctly adds debugging capabilities without functional regressions. The performance impact in nearest_int appears to be a side effect of compiler optimization changes triggered by the new build configuration rather than direct code modifications.

Actionable Recommendations:

  • Fix typo in error message: "failured" → "failed"
  • Consider investigating the nearest_int regression separately as it's unrelated to the PR's core functionality
  • Ensure CI tests both configurations to maintain robustness

The changes successfully implement the intended debugging feature while maintaining overall system performance.

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10
@loci-dev loci-dev force-pushed the main branch 25 times, most recently from 7dd50b8 to 3163acc Compare November 26, 2025 21:07
@loci-review
Copy link

loci-review bot commented Nov 27, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #215

Analysis Scope: 8 files modified, focusing on memory allocation debugging and unified KV cache sequence handling.


Overview

This PR introduces a compile-time debugging flag GGML_SCHED_NO_REALLOC and modifies sequence count calculation in unified KV cache mode. The changes result in +0.012% power consumption increase in libllama.so with minimal impact on inference performance.


Key Findings

Performance-Critical Functions Impact

Memory Allocation Path:

  • ggml_backend_sched_alloc_splits: Added conditional abort path when strict allocation mode is enabled. No performance impact in default configuration.
  • ggml_gallocr_reserve_n: Modified debug logging to suppress initial allocation messages. Negligible impact on debug builds.

Context Initialization:

  • llama_context::llama_context and llama_context::memory_update: Changed from n_seqs = kv_unified ? 1 : n_seq_max to n_seqs = n_seq_max. This increases graph allocation size in unified cache mode by a factor of n_seq_max.

STL and Iterator Functions:
The observed performance changes in STL functions are indirect effects of larger data structures:

  • __gnu_cxx::__normal_iterator::operator-: +33 ns response time increase due to operations on larger vector containers
  • __gnu_cxx::__ops::__pred_iter: +32 ns response time increase in Mirostat v2 sampling from processing more sequences
  • llama_kv_cells::ext_get: +21 ns response time increase from additional cache cell access patterns
  • std::vector::assign: +19 ns throughput increase from larger character buffer operations
  • std::_Rb_tree::_M_insert_: +26 ns throughput increase from tracking more nodes in scheduling structures

Inference Performance Impact

Tokenization and Inference Functions:
No direct modifications to llama_decode, llama_encode, or llama_tokenize functions. The response time and throughput of these core inference functions remain unchanged. Based on the reference measurement where 2 ms degradation in llama_decode causes 7% reduction in tokens per second, the observed changes in this PR have negligible impact on inference throughput.

Affected inference-adjacent functions:

  • Sampling operations: +32 ns in predicate evaluation
  • Memory management: +21 ns in KV cache access
  • Batch processing: Indirect effects from larger graph allocations

Tokens per second impact: Negligible. The absolute time increases are in the 20-35 ns range for supporting functions, which translates to less than 0.001% impact on overall inference latency.

Power Consumption Analysis

Binary-level impact:

  • build.bin.libllama.so: +22 nJ (+0.012%) - Primary binary with minimal energy efficiency change
  • build.bin.libggml-base.so: +11 nJ (+0.019%) - GGML base operations remain stable
  • All other binaries: No measurable change

The power consumption increase is within measurement noise and reflects the cumulative effect of slightly larger data structure operations rather than algorithmic inefficiency.

Code Change Analysis

Primary modifications:

  1. Added GGML_SCHED_NO_REALLOC preprocessor flag for debugging allocation issues
  2. Simplified sequence count calculation by removing conditional logic for unified cache mode
  3. Enhanced debug logging to reduce noise from initial allocations
  4. Updated embedding application to set n_parallel = n_seq_max in unified mode

Semantic impact:
The sequence count change increases memory allocation in unified KV cache mode. When kv_unified = true and n_seq_max > 1, the system now allocates graph resources for all sequences rather than optimizing for single-sequence case. This explains the observed increases in container operations and memory management functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants