UPSTREAM PR #18012: Async DirectIO model loading on Linux#559
UPSTREAM PR #18012: Async DirectIO model loading on Linux#559
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #559 - Direct I/O Model LoadingOverviewPR #559 implements Direct I/O (O_DIRECT) model loading on Linux, replacing memory-mapped I/O as the default strategy. The changes span 5 files with 255 additions and 7 deletions, primarily affecting the model loading subsystem. Analysis reveals no impact on inference performance, as modifications target initialization paths rather than runtime execution. Key FindingsPerformance-Critical Areas ImpactModel Loading Functions: Inference Path Functions: Tokens Per Second ImpactNo inference performance impact. The Direct I/O implementation affects only the model loading phase, which occurs once at startup. Functions responsible for token generation ( The reference benchmark (7% tokens per second reduction when Power Consumption AnalysisBinary-level changes:
Power consumption changes are negligible across all binaries, with maximum absolute change of 357 nJ in Code Implementation AnalysisThe PR introduces platform-specific code for Linux with proper isolation using preprocessor directives. Key implementation details:
The default behavior change ( Absolute Performance ChangesModel loading performance improvements are substantial for the target use case (NVMe storage, large models): claimed reduction from 110s to 10.5s represents 99.5s absolute improvement. However, this occurs during initialization only and does not affect steady-state inference performance measured in tokens per second. |
00e159b to
4c091dc
Compare
5b544dd to
26a6f0f
Compare
Mirrored from ggml-org/llama.cpp#18012
Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.
While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.
I would propose to set uncached reads as default, Windows already has async uncached IO (PR)