Skip to content

UPSTREAM PR #18012: Async DirectIO model loading on Linux#559

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18012-branch_JTischbein-direct_io_model_read_linux
Open

UPSTREAM PR #18012: Async DirectIO model loading on Linux#559
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18012-branch_JTischbein-direct_io_model_read_linux

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18012

Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.

While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.

I would propose to set uncached reads as default, Windows already has async uncached IO (PR)

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 14, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #559 - Direct I/O Model Loading

Overview

PR #559 implements Direct I/O (O_DIRECT) model loading on Linux, replacing memory-mapped I/O as the default strategy. The changes span 5 files with 255 additions and 7 deletions, primarily affecting the model loading subsystem. Analysis reveals no impact on inference performance, as modifications target initialization paths rather than runtime execution.

Key Findings

Performance-Critical Areas Impact

Model Loading Functions:
The changes exclusively affect model loading infrastructure in llama-model-loader.cpp and llama-mmap.cpp. The llama_model_loader::load_all_data function now implements aligned buffer management and chunked reading for Direct I/O. On Linux, staging buffer size increased from 1MB to 64MB, with 4KB alignment requirements. The implementation adds alignment calculation overhead and temporary buffer allocation per tensor read, but these operations occur only during model initialization, not during inference.

Inference Path Functions:
Core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. The flame graph and CFG analyses from the previous version comparison (showing regressions in arg.cpp lambda operators) are unrelated to this PR's Direct I/O changes. Those regressions stem from argument parsing enhancements, not model loading modifications.

Tokens Per Second Impact

No inference performance impact. The Direct I/O implementation affects only the model loading phase, which occurs once at startup. Functions responsible for token generation (llama_decode, llama_encode) remain unchanged with 0 ns difference in response time and throughput. Therefore, tokens per second during inference is unaffected by this PR.

The reference benchmark (7% tokens per second reduction when llama_decode is 2 ms slower) does not apply here, as llama_decode execution time remains constant.

Power Consumption Analysis

Binary-level changes:

  • libllama.so: No power consumption change (0.000%)
  • llama-cvector-generator: -0.021% change (253,377 nJ → 253,325 nJ)
  • llama-tts: -0.122% change (257,883 nJ → 257,568 nJ)
  • llama-run: -0.000% change

Power consumption changes are negligible across all binaries, with maximum absolute change of 357 nJ in libllama.so. The slight reductions in some binaries reflect minor optimizations in non-critical paths unrelated to Direct I/O implementation.

Code Implementation Analysis

The PR introduces platform-specific code for Linux with proper isolation using preprocessor directives. Key implementation details:

  • File I/O Layer: Added dual-mode file access (Direct I/O via file descriptor or standard buffered I/O via FILE*) in llama_file::impl
  • Alignment Handling: Implemented read_aligned_chunk helper function to satisfy O_DIRECT's 4KB alignment requirements
  • Buffer Management: Increased staging buffers from 4MB total to 256MB total on Linux for optimal NVMe throughput
  • Async Upload Path: Enhanced GPU tensor loading with aligned chunked reading while maintaining 4-buffer pipeline

The default behavior change (use_mmap = false) inverts the loading strategy, with explicit --mmap flag added for backward compatibility. This semantic shift may require user awareness but maintains functional compatibility.

Absolute Performance Changes

Model loading performance improvements are substantial for the target use case (NVMe storage, large models): claimed reduction from 110s to 10.5s represents 99.5s absolute improvement. However, this occurs during initialization only and does not affect steady-state inference performance measured in tokens per second.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 00e159b to 4c091dc Compare December 17, 2025 04:17
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 5b544dd to 26a6f0f Compare December 22, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants