Async DirectIO model loading on Linux by JTischbein · Pull Request #18012 · ggml-org/llama.cpp

JTischbein · 2025-12-13T22:56:27Z

Implements Direct I/O (uncached) file reading on Linux to improve model loading performance by bypassing the page cache. This is especially beneficial for large model files.

While mmap is fast on loading the same model multiple times, uncached read provides consistent model loading times at the speed of the sequential disk read speed. On DGX Spark loading GPT-OSS-120B-MXFP4 using mmap takes ~110s, in the following loads ~67s. With these changes it takes consistently ~10.5s. The speedup depends on the model size, the disk read speed and for sequential loading the available RAM.

I would propose to set uncached reads as default, Windows already has async uncached IO (PR)

common/arg.cpp

ggerganov

This results in a huge load speedup on DGX Spark and also at the end of the program leaves the memory in state free instead of buff/cache.

Currently, the implementation is gated behind defined(__linux__). Is this functionality generally supported across all linux platforms? If I am reading this correctly, it boils down to having O_DIRECT support for open().

Also, do we expect this change to also have effect on non-DGX Spark systems?

lemmi · 2025-12-14T21:19:22Z

On my strix halo machine with btrfs, this is strictly worse than master or with mmap. mmap shows the highest throughput while loading the model (~6GByte/s), master is around 3GByte/s and this patch is 2GByte/s.

ehoogeveen-medweb · 2025-12-14T21:27:34Z

IIRC with Strix Halo and ROCm/HIP, loading a model into memory reserved for the GPU using mmap has a major performance issue, hanging basically indefinitely for larger models. Given that reserving memory for the GPU also means having less RAM available to the CPU, it would be great if this DirectIO doesn't have that issue as it would make ROCm/HIP more viable for larger models. Vulkan doesn't have this issue.

JTischbein · 2025-12-15T12:57:11Z

@ggerganov I have added a fallback open() in case O_DIRECT is not available. O_DIRECT is supported on Linux since 2.4.10.

In my tests the first (cold) load time improved with every system configuration (PCIe4.0/PCIe5.0 SSD, RTX5080/5090). On the second load mmap=true was faster again, but only in case the model fitted into the VRAM. Overall, with a fast disk, the load time is also near the cached load with mmap=true.

@lemmi Which disk are you using? And I assume the 6GB/s load with mmap=true was a loading from cache, not the first cold load? The difference between std::fread and read is odd, I will have a look into it.

lemmi · 2025-12-15T22:42:37Z

So, I ran a bunch of tests with vulkan backend to test #18012 and #18047 against master.

GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 /usr/bin/time -v build/bin/llama-cli -m ../models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4.gguf -p "bla" -n 0 --single-turn --mmap
GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 /usr/bin/time -v build/bin/llama-cli -m ../models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4.gguf -p "bla" -n 0 --single-turn --no-mmap
/usr/bin/time -v build/bin/llama-cli -m ../models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4.gguf -p "bla" -n 0 --single-turn --no-mmap

Minisforum MS-S1 Max (AMD RYZEN AI MAX+ 395 w/ Radeon 8060S)
2x WD_BLACK SN850X HS 2000GB, RAID 0, btrfs
Kernel 6.18.0_1

Configuration	master (`d6a1e18`)	#18012	#18047	#18012 + #18047
`GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1` + `--mmap`	0:43.37	0:30.31	0:31.02	0:37.36
`GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1` + `--no-mmap`	0:17.53	0:36.17	Error (out‑of‑memory)	0:17.98
`--no-mmap`	0:17.57	0:36.42	Error (out‑of‑memory)	0:18.08

ddrescue	avg GB/s	time
buffered	6.3	0:09.57
direct I/O	1.9	31.72

The --mmap case is weird. It starts out with >6GB/s, then short pause, then depending on the PR it looks like the whole model is read again at 2-4GB/s.
With --no-mmap, the throughput is also always below 4GB/s and using direct I/O is the worst case. Direct I/O is a little tricky on a CoW FS, so maybe it's not a very optimized path.
(Ideally vulkan could just use mmaped files, but i have no idea, whether that's possible)

(EDIT: of course I was a good boy and ran echo 3 > /proc/sys/vm/drop_caches between tests)

JTischbein · 2025-12-16T08:29:34Z

Thank you for testing this @lemmi ! Looking at your numbers it seems like read() with 18012 + 18047 is falling back to buffered IO, leading to similar performance as on master with fread(). I have implementing a filesystem check to decide whether to use read() with O_DIRECT or to use fopen()/fread(). Does it still make sense when read() and fread() perform equally?

Is --mmap on a warm start quicker than --no-mmap on your machine @lemmi ?

ggerganov · 2025-12-16T08:59:56Z

common/common.h

    bool kv_unified        = false; // enable unified KV cache

    bool input_prefix_bos  = false; // prefix BOS to user inputs, preceding input_prefix
-    bool use_mmap          = true;  // use mmap for faster loads


Changing this to false by default, results in a huge slowdown on MacOS with default arguments:

time ./bin/llama-completion -m ../models/gpt-oss-120b/ggml-model-mxfp4.gguf -p "hello" -n 1 -no-cnv # master real 0m4.648s # PR real 0m17.957s

Not sure what is the best way to handle this. If we keep it true, then linux users would not get the benefit of Direct IO. If we switch to false, Mac users will take the hit.

Would it be OK to set mmap depending on the platform?

We don't have such precedent atm for any of the parameters in common, so I would say it's not ideal.

I have on M4 Pro with GPT-OSS-20B on cold load --no-mmap: 4.168s --mmap: 6.3s. The warm load however takes with --mmap 2.1s (--no-mmap still ~4.1s).

Measured using time ./llama-cli -m /Users/jtischbein/Documents/models/openai_gpt-oss-20b-MXFP4.gguf --no-mmap -p "bla" -n 0 --single-turn and filesystem cache cleared using purge.

So the cold load time is still faster using --mmap, but unfortunately not as fast as on Linux.

We can do the following:

Add new CLI argument --direct-io, -dio

Description: "Use DirectIO if available. Takes precedence over --mmap"

Keep use_mmap == true and use_direct_io == true

On Mac, the internal implementation will determine that DIO is not available so it will fallback to mmap

Might want to do it in a separate PR as it would require changes in libllama API. This PR should keep use_mmap == true by default.

Sounds good

src/llama-model-loader.cpp

src/llama-mmap.h

…ns in llama-mmap.cpp

JTischbein · 2025-12-16T16:47:24Z

The commit removes the branching in the llama-model-loader.cpp and reduces the code duplications in llama-mmap.cpp. Now DirectIO is easier to integrate on Windows and Mac.

src/llama-model-loader.cpp

ggerganov

Let's restore use_mmap to true and we can merge.

JTischbein · 2025-12-17T17:41:20Z

I will file a PR which implements the argument use_direct_io later

This reverts commit 4d4f4ca.

rankaiyx · 2025-12-19T14:43:12Z

For NVMe drive, you can partition one disk into multiple sections, create a RAID0 array with mdadm, and use striping to automatically turn single-threaded I/O requests into parallel operations.

The kernel and NVMe's multi-queue feature handle this parallel processing without requiring any application changes.

This approach lets you reach the NVMe’s full bandwidth and benefits all applications.

…rg#18012)"" This reverts commit a45fc5e.

NeoZhangJianyu · 2025-12-24T05:42:58Z

@JTischbein
This PR lead to issue: #18296
When add parameter "--no-mmap", there is an error to read file.

Could you fix it or revert as soon?

Thank you!

This code report the error:

} else {
            bool successful = false;
            while (!successful) {
                off_t ret = read(fd, ptr, len);

                if (ret == -1) {
                    if (errno == EINTR) {
                        continue;  // Interrupted by signal, retry
                    }
                    throw std::runtime_error(format("read error: %s", strerror(errno)));
                }
                if (ret == 0) {
                    throw std::runtime_error("unexpectedly reached end of file");
                }

                successful = true;
            }
        }

JTischbein · 2025-12-29T20:45:36Z

@NeoZhangJianyu There is a fix for Vulkan now in #18467 by using a buffer from the host. Could you try the same in the SYCL backend? Thank you

NeoZhangJianyu · 2025-12-30T01:48:16Z

@NeoZhangJianyu There is a fix for Vulkan now in #18467 by using a buffer from the host. Could you try the same in the SYCL backend? Thank you
@JTischbein @jeffbolznv

#18467 only support --mmap and --no-mmap, not support -dio and -ndio.
I test it with SYCL backend.

--mmap is quicker than --no-mmap.
Both work wells.

Thank you!

JTischbein · 2025-12-30T08:06:19Z

@NeoZhangJianyu -dio is enabled implicitly in the master branch when applying --no-mmap.

The performance of --mmap and --no-mmap heavily depends on the platform. When launching Llama.cpp a model the first time since booting the machine --no-mmap with direct io is usually faster. On the second launch --mmap is often faster, but only in case the model is still in the filesystem cache (means CPU RAM is large enough and the filesystem cache did not get dropped due to other processes requesting too much memory).

Great to hear the PR also fixes the loading with the SYCL backend, thanks for testing!

NeoZhangJianyu · 2025-12-31T01:16:37Z

The PR #18467 or it's base just fix the crash issue.
It can restore the performance of load model (--no-mmap) of earlier version of #18012.

NeoZhangJianyu · 2026-01-05T02:12:59Z

@NeoZhangJianyu There is a fix for Vulkan now in #18467 by using a buffer from the host. Could you try the same in the SYCL backend? Thank you

@JTischbein
The similar solution of Vulkan now in #18467 can't support SYCL:

SYCL could use buffer with host_ptr like Vulkan. But SYCL backend use USM (Unified Shared Memory ) to manage the memory. It's better than buffer.
But USM can't support the host ptr like Vulkan.

Thank you!

* Uncached model read * Removing additional --mmap arg * Removing trailing whitespaces * Adding fallback when O_DIRECT is not supported * Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp * Adding maybe unused keyword for Mac and Windows. * File seek aligned * Removing all branches for direct_io in llama-model-loader.cpp * Always use alignment from llama_file * use_mmap=true

Sync with upstream llama.cpp PR ggml-org#18012 (Async DirectIO model loading). - Add use_direct_io field to llama_model_params struct - Add has_direct_io() method to llama_file - Update llama_model_loader to accept use_direct_io parameter - Direct I/O takes precedence over mmap when enabled

* Uncached model read * Removing additional --mmap arg * Removing trailing whitespaces * Adding fallback when O_DIRECT is not supported * Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp * Adding maybe unused keyword for Mac and Windows. * File seek aligned * Removing all branches for direct_io in llama-model-loader.cpp * Always use alignment from llama_file * use_mmap=true

Uncached model read

3074b50

JTischbein requested a review from ggerganov as a code owner December 13, 2025 22:56

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18012: Async DirectIO model loading on Linux auroralabs-loci/llama.cpp#559

Open

taronaeo reviewed Dec 14, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

JTischbein added 2 commits December 14, 2025 09:41

Removing additional --mmap arg

26cc75f

Removing trailing whitespaces

ceccfb9

ggerganov reviewed Dec 14, 2025

View reviewed changes

jeffbolznv mentioned this pull request Dec 15, 2025

vulkan: Implement set_tensor_async and the event interfaces #18047

Merged

Adding fallback when O_DIRECT is not supported

d2acc3a

ggerganov reviewed Dec 16, 2025

View reviewed changes

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Dec 16, 2025

View reviewed changes

src/llama-mmap.h Outdated Show resolved Hide resolved

Remove branching in llama-model-loader.cpp and reduce code duplicatio…

f6d79fe

…ns in llama-mmap.cpp

JTischbein added 2 commits December 16, 2025 20:00

Adding maybe unused keyword for Mac and Windows.

0879d22

File seek aligned

fff1157

ggerganov reviewed Dec 17, 2025

View reviewed changes

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Dec 17, 2025

View reviewed changes

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

JTischbein added 2 commits December 17, 2025 13:21

Removing all branches for direct_io in llama-model-loader.cpp

d73ff6a

Always use alignment from llama_file

99fde72

ggerganov approved these changes Dec 17, 2025

View reviewed changes

use_mmap=true

921d7c9

ggerganov merged commit 4d4f4ca into ggml-org:master Dec 18, 2025
71 checks passed

JTischbein mentioned this pull request Dec 18, 2025

Adding --direct-io flag for model loading #18166

Merged

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 19, 2025

Revert "llama : Async DirectIO model loading on Linux (ggml-org#18012)"

a45fc5e

This reverts commit 4d4f4ca.

JTischbein mentioned this pull request Dec 19, 2025

Fixing Windows --no-mmap model loading #18204

Merged

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 20, 2025

Revert "Revert "llama : Async DirectIO model loading on Linux (ggml-o…

714ab06

…rg#18012)"" This reverts commit a45fc5e.

NeoZhangJianyu mentioned this pull request Dec 24, 2025

Bug: Issue when using igpu ( syscl backend) #18296

Closed

0cc4m mentioned this pull request Dec 24, 2025

Misc. bug: Vulkan Backend llama-server and llama-bench Cannot Run Model with mmap = 0 #18317

Closed

triplenom mentioned this pull request Dec 31, 2025

Linux Direct I/O: handle short reads to avoid corrupted weights during inference #18504

Merged

ggerganov mentioned this pull request Jan 27, 2026

llama : disable Direct IO by default #19109

Merged

abc-nix mentioned this pull request Feb 22, 2026

[WIP] DirectIO: faster model loading (Linux) ikawrakow/ik_llama.cpp#1301

Closed

6 tasks

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Conversation

JTischbein commented Dec 13, 2025

Uh oh!

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemmi commented Dec 14, 2025

Uh oh!

ehoogeveen-medweb commented Dec 14, 2025

Uh oh!

JTischbein commented Dec 15, 2025

Uh oh!

lemmi commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JTischbein commented Dec 16, 2025

Uh oh!

ggerganov Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JTischbein Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JTischbein Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JTischbein Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JTischbein commented Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

JTischbein commented Dec 17, 2025

Uh oh!

Uh oh!

rankaiyx commented Dec 19, 2025

Uh oh!

NeoZhangJianyu commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JTischbein commented Dec 29, 2025

Uh oh!

NeoZhangJianyu commented Dec 30, 2025

Uh oh!

JTischbein commented Dec 30, 2025

Uh oh!

NeoZhangJianyu commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ggerganov left a comment •

edited

Loading

lemmi commented Dec 15, 2025 •

edited

Loading

NeoZhangJianyu commented Dec 24, 2025 •

edited

Loading

NeoZhangJianyu commented Dec 31, 2025 •

edited

Loading