UPSTREAM PR #18012: Async DirectIO model loading on Linux by loci-dev · Pull Request #559 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-14T00:49:15Z

Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.

While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.

I would propose to set uncached reads as default, Windows already has async uncached IO (PR)

loci-review · 2025-12-14T01:42:01Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #559 - Direct I/O Model Loading

Overview

PR #559 implements Direct I/O (O_DIRECT) model loading on Linux, replacing memory-mapped I/O as the default strategy. The changes span 5 files with 255 additions and 7 deletions, primarily affecting the model loading subsystem. Analysis reveals no impact on inference performance, as modifications target initialization paths rather than runtime execution.

Key Findings

Performance-Critical Areas Impact

Model Loading Functions:
The changes exclusively affect model loading infrastructure in llama-model-loader.cpp and llama-mmap.cpp. The llama_model_loader::load_all_data function now implements aligned buffer management and chunked reading for Direct I/O. On Linux, staging buffer size increased from 1MB to 64MB, with 4KB alignment requirements. The implementation adds alignment calculation overhead and temporary buffer allocation per tensor read, but these operations occur only during model initialization, not during inference.

Inference Path Functions:
Core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. The flame graph and CFG analyses from the previous version comparison (showing regressions in arg.cpp lambda operators) are unrelated to this PR's Direct I/O changes. Those regressions stem from argument parsing enhancements, not model loading modifications.

Tokens Per Second Impact

No inference performance impact. The Direct I/O implementation affects only the model loading phase, which occurs once at startup. Functions responsible for token generation (llama_decode, llama_encode) remain unchanged with 0 ns difference in response time and throughput. Therefore, tokens per second during inference is unaffected by this PR.

The reference benchmark (7% tokens per second reduction when llama_decode is 2 ms slower) does not apply here, as llama_decode execution time remains constant.

Power Consumption Analysis

Binary-level changes:

libllama.so: No power consumption change (0.000%)
llama-cvector-generator: -0.021% change (253,377 nJ → 253,325 nJ)
llama-tts: -0.122% change (257,883 nJ → 257,568 nJ)
llama-run: -0.000% change

Power consumption changes are negligible across all binaries, with maximum absolute change of 357 nJ in libllama.so. The slight reductions in some binaries reflect minor optimizations in non-critical paths unrelated to Direct I/O implementation.

Code Implementation Analysis

The PR introduces platform-specific code for Linux with proper isolation using preprocessor directives. Key implementation details:

File I/O Layer: Added dual-mode file access (Direct I/O via file descriptor or standard buffered I/O via FILE*) in llama_file::impl
Alignment Handling: Implemented read_aligned_chunk helper function to satisfy O_DIRECT's 4KB alignment requirements
Buffer Management: Increased staging buffers from 4MB total to 256MB total on Linux for optimal NVMe throughput
Async Upload Path: Enhanced GPU tensor loading with aligned chunked reading while maintaining 4-buffer pipeline

The default behavior change (use_mmap = false) inverts the loading strategy, with explicit --mmap flag added for backward compatibility. This semantic shift may require user awareness but maintains functional compatibility.

Absolute Performance Changes

Model loading performance improvements are substantial for the target use case (NVMe storage, large models): claimed reduction from 110s to 10.5s represents 99.5s absolute improvement. However, this occurs during initialization only and does not affect steady-state inference performance measured in tokens per second.

Uncached model read

3074b50

loci-dev temporarily deployed to PROD__AL_DEMO December 14, 2025 00:49 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 00e159b to 4c091dc Compare December 17, 2025 04:17

loci-dev force-pushed the main branch 30 times, most recently from 5b544dd to 26a6f0f Compare December 22, 2025 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18012: Async DirectIO model loading on Linux#559

UPSTREAM PR #18012: Async DirectIO model loading on Linux#559
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18012-branch_JTischbein-direct_io_model_read_linux

loci-dev commented Dec 14, 2025

Uh oh!

loci-review bot commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 14, 2025

Uh oh!

loci-review bot commented Dec 14, 2025

Performance Analysis Summary: PR #559 - Direct I/O Model Loading

Overview

Key Findings

Performance-Critical Areas Impact

Tokens Per Second Impact

Power Consumption Analysis

Code Implementation Analysis

Absolute Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants