[WIP] DirectIO: faster model loading (Linux) by abc-nix · Pull Request #1301 · ikawrakow/ik_llama.cpp

abc-nix · 2026-02-22T10:04:28Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

I have ported (copied) the DirectIO code from mainline llama.cpp originally added by @JTischbein (PR18012, PR18166, and other later PRs).

DirectIO is disabled by default. If used, it will automatically disable mmap. To use it, either add the -dio or --direct-io flag to the llama-server command. This improves model loading speeds when using flags like -mqkv that disables mmap).

It should improve model loading on Linux, specially for cuda devices. I am observing a ~1.5x model loading speed compared to no-mmap.

Loading speeds to RAM are a bit lower compared to llama.cpp (but the time it takes to prepare the cache before loading the model makes almost equal the amount of time taken to start and load the model).

Some users with NUMA have reported slower loading speeds in mainland, so DirectIO is not recommended for them.

TODO:

Add flag to llama-bench and llama-sweep-bench
Faster model loading on cuda devices for split-mode graph

If you are a Windows user, please test if this builds and runs without the --direct-io flag. Based on the changes I ported from llama.cpp it should work, but I don't have a Windows machine right now to test.

Ph0rk0z · 2026-02-22T10:59:37Z

What does this do to caching the model? My first load is painful but after it is instant. Similar schemes in exllama were a total flop for me with my spinning rust.

abc-nix · 2026-02-22T11:51:35Z

What does this do to caching the model?

As mmap is disabled when using direct-io, there is no model caching. For large models that are constantly being swapped, mmap is the best right now. That is why DirectIO should be disabled by default (only manually enabled when a user isn't going to use mmap).

There is a WIP PR by the same author that combines mmap and DirectIO, to get the best of both worlds, but I won't look into it until this is working properly.

JTischbein · 2026-02-22T12:01:19Z

Whether direct-io is a better choice than mmap heavily depends on your hardware. We observed really slow loading times when loading gpt-oss-120b via mmap on DGX Spark (~60s). Here the model does not even fit into the disk cache, so that following loads were not faster. With DIO the load time is roughly 10s.

In case you have a fast NVMe SSD (best case PCIe5.0) the direct io path will always be fast and keeps your cache clean. While it is still possible that mmap is faster with existing caches, the >10GB/s loading speed using DIO will suffice.

My PR for combining mmap and dio does not aim to cache the GPU layers, but the CPU layers. Currently with DIO enabled the weights for CPU layers MUST fit into the CPU ram, but some people want to launch bigger models and stream their weights from disk.

Please let me know in case you tackle any issues with DIO

Ph0rk0z · 2026-02-22T12:50:09Z

I actually do not use mmap. Its a massive slowdown. Having HW caching from xeon loads the model into ram cache instead. Load once, cry once. If I'm reading this correctly, DiO would bypass this mechanism and load directly from "SSD". In my case this is either sata SSD or sata HDD. no bueno, sadly.

abc-nix · 2026-02-22T14:33:22Z

Thanks, @JTischbein. The implementation works well, though here in ik_llama.cpp loading to RAM uses read_aligned_chunk (so nothing like the 10 GB/s speeds loading to the cuda devices). I will be looking into how to improve this function in the future.

I may in future mention you again when and if I also port (copy) the other PR. Thanks.

ikawrakow · 2026-02-24T15:12:46Z

I have mixed feelings about this PR. I do have an obsession with performance, so from that angle changes that improve performance are good. On the other hand, I don't subscribe to the idea that everything llama.cpp does is good (this project wouldn't exist if that was the case). Perhaps I don't have the right systems to feel the pain, but I have never felt that model loading is a major pain point. I didn't follow the DIO related changes in mainline very closely, but IIRC, there were multiple issues created after DIO got merged, and by now it is no longer the default there. Then there is the issue of model loading on NUMA systems. Based on preliminary work (I got access to a NUMA system quite recently), one can get quite far by just using mmap, pinning threads to specific NUMA nodes, and in that way easily achieving that model weights end up on the NUMA node where they are needed (i.e., where the thread processing the given model weight portion runs). DIO will interfere with that approach.

So, let me think a bit more about this.

Ph0rk0z · 2026-02-24T16:28:34Z

I didn't even consider it would fuck with numa allocations. Is there an actual way to pin threads to work on only local copies of the weights? I thought with two nodes you just give it physical cores and numa distribute handles the rest.

abc-nix · 2026-02-25T11:12:31Z

I will close it then until there is a bit more evolution of the numa optimizations in ik_llama.cpp. I think I should study a bit more about how to generalize the async I/O path used for the cuda backend to a more general backend as seen in llama.cpp.

In my tests, only the layer split-mode cuda tensor loading is using the faster async I/O path (10-11 GB/s for tensor loading from nvme) in scr/llama-model-loader.cpp.

RAM tensor loading uses the ggml_backend_buffer_is_host which uses read_aligned_chunk (optimized for directIO), at around 7.2-7.8 GB/s from nvme.

Tensor loading for cuda devices in graph split-mode is much slower (4.8 - 5 GB/s from nvme) because it uses the fallback (read_buf) method, which is half the loading speed that can be achieved with async I/O.

abc-nix added 4 commits February 22, 2026 09:45

port direct_io implementation

878a670

fix error

7ccac26

fix error2

b6dbd6d

fix error3

e1f7183

add directio to llama-bench

616f2d3

abc-nix closed this Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] DirectIO: faster model loading (Linux)#1301

[WIP] DirectIO: faster model loading (Linux)#1301
abc-nix wants to merge 5 commits intoikawrakow:mainfrom
abc-nix:direct_io1

abc-nix commented Feb 22, 2026 •

edited

Loading

Uh oh!

Ph0rk0z commented Feb 22, 2026

Uh oh!

abc-nix commented Feb 22, 2026 •

edited

Loading

Uh oh!

JTischbein commented Feb 22, 2026

Uh oh!

Ph0rk0z commented Feb 22, 2026 •

edited

Loading

Uh oh!

abc-nix commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

Ph0rk0z commented Feb 24, 2026

Uh oh!

abc-nix commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

abc-nix commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Feb 22, 2026

Uh oh!

abc-nix commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JTischbein commented Feb 22, 2026

Uh oh!

Ph0rk0z commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abc-nix commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

Ph0rk0z commented Feb 24, 2026

Uh oh!

abc-nix commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abc-nix commented Feb 22, 2026 •

edited

Loading

abc-nix commented Feb 22, 2026 •

edited

Loading

Ph0rk0z commented Feb 22, 2026 •

edited

Loading