[WIP] DirectIO: faster model loading (Linux)#1301
[WIP] DirectIO: faster model loading (Linux)#1301abc-nix wants to merge 5 commits intoikawrakow:mainfrom
Conversation
|
What does this do to caching the model? My first load is painful but after it is instant. Similar schemes in exllama were a total flop for me with my spinning rust. |
As There is a WIP PR by the same author that combines mmap and DirectIO, to get the best of both worlds, but I won't look into it until this is working properly. |
|
Whether direct-io is a better choice than mmap heavily depends on your hardware. We observed really slow loading times when loading gpt-oss-120b via mmap on DGX Spark (~60s). Here the model does not even fit into the disk cache, so that following loads were not faster. With DIO the load time is roughly 10s. In case you have a fast NVMe SSD (best case PCIe5.0) the direct io path will always be fast and keeps your cache clean. While it is still possible that mmap is faster with existing caches, the >10GB/s loading speed using DIO will suffice. My PR for combining mmap and dio does not aim to cache the GPU layers, but the CPU layers. Currently with DIO enabled the weights for CPU layers MUST fit into the CPU ram, but some people want to launch bigger models and stream their weights from disk. Please let me know in case you tackle any issues with DIO |
|
I actually do not use mmap. Its a massive slowdown. Having HW caching from xeon loads the model into ram cache instead. Load once, cry once. If I'm reading this correctly, DiO would bypass this mechanism and load directly from "SSD". In my case this is either sata SSD or sata HDD. no bueno, sadly. |
|
Thanks, @JTischbein. The implementation works well, though here in ik_llama.cpp loading to RAM uses I may in future mention you again when and if I also port (copy) the other PR. Thanks. |
|
I have mixed feelings about this PR. I do have an obsession with performance, so from that angle changes that improve performance are good. On the other hand, I don't subscribe to the idea that everything So, let me think a bit more about this. |
|
I didn't even consider it would fuck with numa allocations. Is there an actual way to pin threads to work on only local copies of the weights? I thought with two nodes you just give it physical cores and numa distribute handles the rest. |
|
I will close it then until there is a bit more evolution of the numa optimizations in ik_llama.cpp. I think I should study a bit more about how to generalize the async I/O path used for the cuda backend to a more general backend as seen in llama.cpp. In my tests, only the RAM tensor loading uses the Tensor loading for cuda devices in |
I have ported (copied) the DirectIO code from mainline llama.cpp originally added by @JTischbein (PR18012, PR18166, and other later PRs).
DirectIO is disabled by default. If used, it will automatically disable
mmap. To use it, either add the-dioor--direct-ioflag to the llama-server command. This improves model loading speeds when using flags like-mqkvthat disables mmap).It should improve model loading on Linux, specially for cuda devices. I am observing a ~1.5x model loading speed compared to no-mmap.
Loading speeds to RAM are a bit lower compared to llama.cpp (but the time it takes to prepare the cache before loading the model makes almost equal the amount of time taken to start and load the model).
Some users with NUMA have reported slower loading speeds in mainland, so DirectIO is not recommended for them.
TODO:
If you are a Windows user, please test if this builds and runs without the
--direct-ioflag. Based on the changes I ported from llama.cpp it should work, but I don't have a Windows machine right now to test.