Skip to content

[WIP] DirectIO: faster model loading (Linux)#1301

Closed
abc-nix wants to merge 5 commits intoikawrakow:mainfrom
abc-nix:direct_io1
Closed

[WIP] DirectIO: faster model loading (Linux)#1301
abc-nix wants to merge 5 commits intoikawrakow:mainfrom
abc-nix:direct_io1

Conversation

@abc-nix
Copy link
Contributor

@abc-nix abc-nix commented Feb 22, 2026

I have ported (copied) the DirectIO code from mainline llama.cpp originally added by @JTischbein (PR18012, PR18166, and other later PRs).

DirectIO is disabled by default. If used, it will automatically disable mmap. To use it, either add the -dio or --direct-io flag to the llama-server command. This improves model loading speeds when using flags like -mqkv that disables mmap).

It should improve model loading on Linux, specially for cuda devices. I am observing a ~1.5x model loading speed compared to no-mmap.

Loading speeds to RAM are a bit lower compared to llama.cpp (but the time it takes to prepare the cache before loading the model makes almost equal the amount of time taken to start and load the model).

Some users with NUMA have reported slower loading speeds in mainland, so DirectIO is not recommended for them.

TODO:

  • Add flag to llama-bench and llama-sweep-bench
  • Faster model loading on cuda devices for split-mode graph

If you are a Windows user, please test if this builds and runs without the --direct-io flag. Based on the changes I ported from llama.cpp it should work, but I don't have a Windows machine right now to test.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 22, 2026

What does this do to caching the model? My first load is painful but after it is instant. Similar schemes in exllama were a total flop for me with my spinning rust.

@abc-nix
Copy link
Contributor Author

abc-nix commented Feb 22, 2026

What does this do to caching the model?

As mmap is disabled when using direct-io, there is no model caching. For large models that are constantly being swapped, mmap is the best right now. That is why DirectIO should be disabled by default (only manually enabled when a user isn't going to use mmap).

There is a WIP PR by the same author that combines mmap and DirectIO, to get the best of both worlds, but I won't look into it until this is working properly.

@JTischbein
Copy link

Whether direct-io is a better choice than mmap heavily depends on your hardware. We observed really slow loading times when loading gpt-oss-120b via mmap on DGX Spark (~60s). Here the model does not even fit into the disk cache, so that following loads were not faster. With DIO the load time is roughly 10s.

In case you have a fast NVMe SSD (best case PCIe5.0) the direct io path will always be fast and keeps your cache clean. While it is still possible that mmap is faster with existing caches, the >10GB/s loading speed using DIO will suffice.

My PR for combining mmap and dio does not aim to cache the GPU layers, but the CPU layers. Currently with DIO enabled the weights for CPU layers MUST fit into the CPU ram, but some people want to launch bigger models and stream their weights from disk.

Please let me know in case you tackle any issues with DIO

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 22, 2026

I actually do not use mmap. Its a massive slowdown. Having HW caching from xeon loads the model into ram cache instead. Load once, cry once. If I'm reading this correctly, DiO would bypass this mechanism and load directly from "SSD". In my case this is either sata SSD or sata HDD. no bueno, sadly.

@abc-nix
Copy link
Contributor Author

abc-nix commented Feb 22, 2026

Thanks, @JTischbein. The implementation works well, though here in ik_llama.cpp loading to RAM uses read_aligned_chunk (so nothing like the 10 GB/s speeds loading to the cuda devices). I will be looking into how to improve this function in the future.

I may in future mention you again when and if I also port (copy) the other PR. Thanks.

@ikawrakow
Copy link
Owner

I have mixed feelings about this PR. I do have an obsession with performance, so from that angle changes that improve performance are good. On the other hand, I don't subscribe to the idea that everything llama.cpp does is good (this project wouldn't exist if that was the case). Perhaps I don't have the right systems to feel the pain, but I have never felt that model loading is a major pain point. I didn't follow the DIO related changes in mainline very closely, but IIRC, there were multiple issues created after DIO got merged, and by now it is no longer the default there. Then there is the issue of model loading on NUMA systems. Based on preliminary work (I got access to a NUMA system quite recently), one can get quite far by just using mmap, pinning threads to specific NUMA nodes, and in that way easily achieving that model weights end up on the NUMA node where they are needed (i.e., where the thread processing the given model weight portion runs). DIO will interfere with that approach.

So, let me think a bit more about this.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 24, 2026

I didn't even consider it would fuck with numa allocations. Is there an actual way to pin threads to work on only local copies of the weights? I thought with two nodes you just give it physical cores and numa distribute handles the rest.

@abc-nix
Copy link
Contributor Author

abc-nix commented Feb 25, 2026

I will close it then until there is a bit more evolution of the numa optimizations in ik_llama.cpp. I think I should study a bit more about how to generalize the async I/O path used for the cuda backend to a more general backend as seen in llama.cpp.

In my tests, only the layer split-mode cuda tensor loading is using the faster async I/O path (10-11 GB/s for tensor loading from nvme) in scr/llama-model-loader.cpp.

RAM tensor loading uses the ggml_backend_buffer_is_host which uses read_aligned_chunk (optimized for directIO), at around 7.2-7.8 GB/s from nvme.

Tensor loading for cuda devices in graph split-mode is much slower (4.8 - 5 GB/s from nvme) because it uses the fallback (read_buf) method, which is half the loading speed that can be achieved with async I/O.

@abc-nix abc-nix closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants