Auto-fit offloaded tensors to available VRAM (MoE models)#1501
Auto-fit offloaded tensors to available VRAM (MoE models)#1501
Conversation
|
Aha so the format and the algo how the splits are made has been changed? Before that it was like this: [EDIT]: I am using [EDIT2]: Here is the illustration that if one minimizes the latencies between the GPUs of each split (esp. the first one), the prefill would go up: #1380 (comment) So if the algo changes, the I have to find a new combination of |
Nothing has changed when setting the splits. It is just a different formatting the @Nexesenex added in #1494 |
Ah. I see. Its not just a different formatting. |
|
Finally figured how to use the
Before:
After: The performance is a little bit off, because I was unable to guess the right order of the GPUs. So some slight change took place the way the splits are made? |
|
Oh I see. This is probably it: Possibly this change related to the accounting for the experts took place. And now everything started to behave more "unpredictably". Namely, I had been able to find only two configurations out of hundreds where the GPUs are placed exactly where I would like them to. |
|
Okay cool. One of the configurations works fine.
So given the speeds (and the latencies), the system config is: So the first split gets GPU#7, GPU#8 and the "main" GPUs (GPU#0 and GPU#1) for the first split. So if we translate it from the CUDA_VISIBLE_DEVICES ...
So all of the GPUs ( Then again for split split split split So the split config makes sure the first split getting all x8 GPUs including the "main" ones. |
|
Made a plot. There is, apparently, a correlation -- lower total latency = lower total inference time. |
|
I got a small reference for Kimi K2.5, as I'm getting a segmentation For 4x5090+2x4090+A6000+A40 With
On iklcpp, this is the log: Then it logs: And finally, I get: On llamacpp for reference: And then: And then it works fine. What could be the issue? My system is pretty out of the ordinary, though. |
|
I am having the same issue with GLM-5 after pr-1506: #1506 Try to roll back. |
|
Thanks, I seem to get another issue. Using manual -ot reverting #1506 works but this took hours to get to lol |
|
The segmentation fault is fixed in #1515 If you get OOM while allocating the compute buffers, increase the safety margin. E.g., |
This PR adds the ability to automatically determine which tensors to offload to the GPUs based on the available VRAM. This can be enabled by adding
--fitto the command line. Optionally one can also specify a "safety" margin (amount of unused VRAM to handle compute buffers that have not been accounted for) using--fit-margin margin_in_MiB. If--fit-marginis not specified, by default 1 GiB is used (1024 MiB).Auto-fit is not enabled by default for now because
llama.cpp, no worst case compute graph is constructed. Hence, the size of the required compute buffers is just estimated, which may be off in some cases--fit-marginfor best resultsllama.cpp, no provisions are made to adjust context and/or u-batch size if necessary.--override-tensor, --cpu-moe, --n-cpu-moetogether with--fitDespite these limitations, it does a pretty decent job in basically all cases I tested with a 1x3090 and 2x3090. I had to adjust the
--fit-marginto 1536 MiB on one occasion, everything else just worked with split modegraph(when supported) and split modelayer.