Skip to content

UPSTREAM PR #19109: llama : disable Direct IO by default#1040

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19109-branch_ggml-org-gg/llama-dio-off
Open

UPSTREAM PR #19109: llama : disable Direct IO by default#1040
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19109-branch_ggml-org-gg/llama-dio-off

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-review
Copy link
Copy Markdown

loci-review bot commented Jan 26, 2026

Performance Review Report: llama.cpp Base → Target Version

Impact Classification: Minor Impact

Commit: 5ef960f - "llama : disable Direct IO by default" by Georgi Gerganov
Files Changed: 5 modified, 37 added, 3 deleted
Functions Analyzed: 13 functions across 3 binaries (libllama.so, llama-cvector-generator, llama-tts)

Executive Summary

Performance changes stem from build configuration differences (Debug vs Release) and I/O strategy optimization (Direct I/O → buffered I/O), not algorithmic modifications. All analyzed functions operate in non-critical paths (initialization, argument parsing, utilities). Core inference operations remain unchanged.

Key Findings

Genuine Optimization (2 functions):

  • operator() mmap flag handler: -66.71ns (-80%) in both cvector-generator and llama-tts. Removed coupling between mmap and Direct I/O options, achieving 5x speedup through code simplification.

Build Configuration Effects (2 functions):

  • std::_Rb_tree::begin(): +182ns (+220%) in both binaries. Debug builds with _GLIBCXX_ASSERTIONS add STL validation overhead. Zero impact on Release builds.

I/O Strategy Trade-offs (3 functions in libllama.so):

  • Token sorting comparator: +176ns (+138%) per comparison. Trades individual operation latency for bulk throughput—appropriate for batch inference.
  • KV cache hashtable begin(): -186ns (-64%). Genuine improvement in moderately sensitive KV cache operations.
  • Bigram iterator: +57ns (+48%). Minimal overhead in tokenization preprocessing.

Compiler Variations (6 functions):

  • Mixed results from compiler optimization differences. Notable: vector::begin() improved -180ns (-68%), json::get<bool>() improved -181ns (-75%). Others show negligible changes in initialization code.

Performance-Critical Assessment

Zero impact on critical paths: Matrix multiplication, attention computation, quantization, and sampling algorithms unchanged. Analyzed functions contribute <0.1% to total inference time.

Power consumption: Negligible change (within measurement noise). Startup phase shows slight improvement from buffered I/O; inference overhead is <0.01%.

GPU/ML operations: Zero changes to CUDA, Metal, HIP, Vulkan backends or ML kernels.

Conclusion

Changes represent intentional I/O flexibility improvements with acceptable trade-offs. The 80% speedup in argument parsing demonstrates genuine optimization through architectural improvement. Other changes reflect build configuration differences (Debug assertions) or I/O strategy optimization for typical LLM workloads. No performance concerns for production deployments.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 62bf34b to 10471d1 Compare January 29, 2026 13:31
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 9216bda to 6b41339 Compare February 1, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants