Skip to content

Auto-fit offloaded tensors to available VRAM (MoE models)#1501

Merged
ikawrakow merged 3 commits intomainfrom
ik/model_fit
Mar 25, 2026
Merged

Auto-fit offloaded tensors to available VRAM (MoE models)#1501
ikawrakow merged 3 commits intomainfrom
ik/model_fit

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR adds the ability to automatically determine which tensors to offload to the GPUs based on the available VRAM. This can be enabled by adding --fit to the command line. Optionally one can also specify a "safety" margin (amount of unused VRAM to handle compute buffers that have not been accounted for) using --fit-margin margin_in_MiB. If --fit-margin is not specified, by default 1 GiB is used (1024 MiB).

Auto-fit is not enabled by default for now because

  • It only handles MoE models for now
  • It is still somewhat rough around the edges:
    • Unlike llama.cpp, no worst case compute graph is constructed. Hence, the size of the required compute buffers is just estimated, which may be off in some cases
    • Because of the above, one may need to experiment with --fit-margin for best results
    • The auto-fit will fail if not all non-MoE tensors (i.e., attention tensors, shared experts, dense FFN tensors, output tensor, normalization tensors) can be offloaded to the GPU(s).
    • Unlike llama.cpp, no provisions are made to adjust context and/or u-batch size if necessary.
    • It is not allowed to use any of --override-tensor, --cpu-moe, --n-cpu-moe together with --fit

Despite these limitations, it does a pretty decent job in basically all cases I tested with a 1x3090 and 2x3090. I had to adjust the --fit-margin to 1536 MiB on one occasion, everything else just worked with split mode graph (when supported) and split mode layer.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Mar 25, 2026

Aha so the format and the algo how the splits are made has been changed?

================================ max_gpu = 2
Adjusted split at layer  0:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:  0.5 ; GPU8:    0 ; GPU9:  0.5
Adjusted split at layer 12:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:  0.5 ; GPU5:    0 ; GPU6:  0.5 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 24:  GPU0:  0.5 ; GPU1:    0 ; GPU2:  0.5 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 36:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:  0.5 ; GPU4:    0 ; GPU5:  0.5 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 48:  GPU0:    0 ; GPU1:  0.5 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:  0.5 ; GPU9:    0
Adjusted splits (total)   :  GPU0:  0.5 ; GPU1:  0.5 ; GPU2:  0.5 ; GPU3:  0.5 ; GPU4:  0.5 ; GPU5:  0.5 ; GPU6:  0.5 ; GPU7:  0.5 ; GPU8:  0.5 ; GPU9:  0.5

Before that it was like this:

Adjusted split at layer  0: 0 0 0 0 0 0 0 0.5 0.5 0
Adjusted split at layer 12: 0 0 0 0 0.5 0.5 0 0 0 0
Adjusted split at layer 24: 0 0.5 0.5 0 0 0 0 0 0 0
Adjusted split at layer 36: 0.5 0 0 0.5 0 0 0 0 0 0
Adjusted split at layer 48: 0 0 0 0 0 0 0.5 0 0 0.5

[EDIT]:

I am using llama-sweep-bench / llama-server right now to get this data. The order of the GPUs in the CUDA_VISIBLE_DEVICES seem to influence the way the splits are made. I am not sure why.

[EDIT2]:

Here is the illustration that if one minimizes the latencies between the GPUs of each split (esp. the first one), the prefill would go up: #1380 (comment)

So if the algo changes, the I have to find a new combination of CUDA_VISIBLE_DEVICES that would work.

@ikawrakow
Copy link
Copy Markdown
Owner Author

So if the algo changes, the I have to find a new combination of CUDA_VISIBLE_DEVICES that would work.

Nothing has changed when setting the splits. It is just a different formatting the @Nexesenex added in #1494

@magikRUKKOLA
Copy link
Copy Markdown

It is just a different formatting the @Nexesenex added in #1494

Ah. I see. Its not just a different formatting.
The way the GPU for the split is chosen depend on the free VRAM etc.?
Well, something has changed between the PR I had been testing with and the current main. I am not sure what exactly.

@magikRUKKOLA
Copy link
Copy Markdown

@ikawrakow

Finally figured how to use the git bisect. Double checked few times. The same result.

 git bisect good
86f4f516e508bead7c4c9597fed57ac788792173 is the first bad commit
commit 86f4f516e508bead7c4c9597fed57ac788792173
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date:   Wed Mar 25 07:29:29 2026 +0100

    Auto-fit offloaded tensors to available VRAM (MoE models) (#1501)
    
    * WIP: automatically fit model in available VRAM
    
    * WIP
    
    * This seems pretty solid

 common/common.cpp   |  21 +++++++
 common/common.h     |   2 +
 include/llama.h     |   2 +
 src/llama-arch.h    |  44 +++++++-------
 src/llama-model.cpp |   7 ++-
 src/llama.cpp       | 300 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 6 files changed, 311 insertions(+), 65 deletions(-)

/opt/ubergarm/Qwen3.5-397B-A17B-GGUF/IQ4_KSS/bench.sh

Before:

export CUDA_VISIBLE_DEVICES=6,3,2,4,8,5,9,7,1,0

Adjusted split at layer  0:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:  0.5 ; GPU9:  0.5
Adjusted split at layer 12:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:  0.5 ; GPU6:  0.5 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 24:  GPU0:    0 ; GPU1:    0 ; GPU2:  0.5 ; GPU3:  0.5 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 36:  GPU0:  0.5 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:  0.5 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 48:  GPU0:    0 ; GPU1:  0.5 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:  0.5 ; GPU8:    0 ; GPU9:    0
Adjusted splits (total)   :  GPU0:  0.5 ; GPU1:  0.5 ; GPU2:  0.5 ; GPU3:  0.5 ; GPU4:  0.5 ; GPU5:  0.5 ; GPU6:  0.5 ; GPU7:  0.5 ; GPU8:  0.5 ; GPU9:  0.5
main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    1.845 |  1110.26 |    8.655 |    59.15 |

After:
export CUDA_VISIBLE_DEVICES=6,3,2,4,8,5,9,7,1,0

Adjusted split at layer  0:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:  0.5 ; GPU8:    0 ; GPU9:  0.5
Adjusted split at layer 12:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:  0.5 ; GPU5:    0 ; GPU6:  0.5 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 24:  GPU0:  0.5 ; GPU1:    0 ; GPU2:  0.5 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 36:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:  0.5 ; GPU4:    0 ; GPU5:  0.5 ; GPU6:    0 ; GPU7:    0 ; GPU8:    0 ; GPU9:    0
Adjusted split at layer 48:  GPU0:    0 ; GPU1:  0.5 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:    0 ; GPU8:  0.5 ; GPU9:    0
Adjusted splits (total)   :  GPU0:  0.5 ; GPU1:  0.5 ; GPU2:  0.5 ; GPU3:  0.5 ; GPU4:  0.5 ; GPU5:  0.5 ; GPU6:  0.5 ; GPU7:  0.5 ; GPU8:  0.5 ; GPU9:  0.5

The performance is a little bit off, because I was unable to guess the right order of the GPUs.

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    1.941 |  1055.02 |    8.724 |    58.69 |

So some slight change took place the way the splits are made?

@magikRUKKOLA
Copy link
Copy Markdown

Oh I see. This is probably it:

        auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch, amb, flash_attn, experts);

Possibly this change related to the accounting for the experts took place. And now everything started to behave more "unpredictably". Namely, I had been able to find only two configurations out of hundreds where the GPUs are placed exactly where I would like them to.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Mar 25, 2026

@ikawrakow

Okay cool. One of the configurations works fine.

export CUDA_VISIBLE_DEVICES=0,3,7,5,4,2,9,1,8,6

================================ max_gpu = 2
Adjusted split at layer  0:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7:  0.5 ; GPU8:    0 ; GPU9:  0.5                                                                                            
Adjusted split at layer 12:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:    0 ; GPU4:  0.5 ; GPU5:    0 ; GPU6:  0.5 ; GPU7
:    0 ; GPU8:    0 ; GPU9:    0                                                                                            
Adjusted split at layer 24:  GPU0:  0.5 ; GPU1:    0 ; GPU2:  0.5 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7
:    0 ; GPU8:    0 ; GPU9:    0                                                                                            
Adjusted split at layer 36:  GPU0:    0 ; GPU1:    0 ; GPU2:    0 ; GPU3:  0.5 ; GPU4:    0 ; GPU5:  0.5 ; GPU6:    0 ; GPU7
:    0 ; GPU8:    0 ; GPU9:    0                                                                                            
Adjusted split at layer 48:  GPU0:    0 ; GPU1:  0.5 ; GPU2:    0 ; GPU3:    0 ; GPU4:    0 ; GPU5:    0 ; GPU6:    0 ; GPU7
:    0 ; GPU8:  0.5 ; GPU9:    0                                                                                            
Adjusted splits (total)   :  GPU0:  0.5 ; GPU1:  0.5 ; GPU2:  0.5 ; GPU3:  0.5 ; GPU4:  0.5 ; GPU5:  0.5 ; GPU6:  0.5 ; GPU7
:  0.5 ; GPU8:  0.5 ; GPU9:  0.5           
main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    1.848 |  1107.95 |    8.641 |    59.25 |
|  2048 |    512 |   2048 |    1.904 |  1075.75 |    8.711 |    58.77 |
|  2048 |    512 |   4096 |    1.904 |  1075.79 |    8.801 |    58.18 |
|  2048 |    512 |   6144 |    1.924 |  1064.27 |    8.915 |    57.43 |

So given the speeds (and the latencies), the system config is:

  GPU 0: 8x                                                                                                                 
  GPU 1: 8x                                                                                                                 
  GPU 2: 4x                                                                                                                 
  GPU 3: 8x                                                                                                                 
  GPU 4: 4x                                                                                                                 
  GPU 5: 4x                                                                                                                 
  GPU 6: 8x                                                                                                                 
  GPU 7: 8x                                                                                                                 
  GPU 8: 8x                                                                                                                 
  GPU 9: 4x    

So the first split gets GPU#7, GPU#8 and the "main" GPUs (GPU#0 and GPU#1) for the first split. So if we translate it from the CUDA_VISIBLE_DEVICES ...

export CUDA_VISIBLE_DEVICES=0,3,7,5,4,2,9,[1],8,6
GPU#7 -> GPU#1

export CUDA_VISIBLE_DEVICES=0,3,7,5,4,2,9,1,8,[6]
GPU#9 -> GPU#6

export CUDA_VISIBLE_DEVICES=[0],3,7,5,4,2,9,1,8,6
GPU#0 -> GPU#0

export CUDA_VISIBLE_DEVICES=0,[3],7,5,4,2,9,1,8,6
GPU#1 -> GPU#3

So all of the GPUs (#1, #6, #0, #3) are x8.

Then again for split #2:
#4,#6 -> #4,#9: so #4 and #9 are both x4 (good).

split #3:
#0, #7 are both x8. (good)

split #4:
#5, #2 are both x4. (good)

split #4:
#3, #8 are both x8. (good)

So the split config makes sure the first split getting all x8 GPUs including the "main" ones.
And the rest of the splits make sure that the rest of x4 GPU (I have four of them) belong to the two separate splits. This way the speed is the highest.

@magikRUKKOLA
Copy link
Copy Markdown

Made a plot.
y axis - total time in ms for 8k prefill and 2k decode
x axis - somewhat accumulated latency between the gpus participating in each split
about 100 permutations of CUDA_VISIBLE_DEVICES with 6 GPU at x8 and 4 GPU at x4.

output

There is, apparently, a correlation -- lower total latency = lower total inference time.

@Panchovix
Copy link
Copy Markdown

I got a small reference for Kimi K2.5, as I'm getting a segmentation

For 4x5090+2x4090+A6000+A40

With

./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-UD-IQ3_XXS-00001-of-00009.gguf' -c 32768 --no-mmap -mg 0 -ub 2560 -b 2560 --fit

On iklcpp, this is the log:

======================================= HAVE_FANCY_SIMD is defined
------------------- Layer sizes:
Layer  0:    322.03,     36.00,    358.03     2048.00  MiB
Layer  1:   7621.47,     36.00,   7657.47     2048.00  MiB
Layer  2:   7634.95,     36.00,   7670.95     2048.00  MiB
Layer  3:   7072.08,     36.00,   7108.08     2048.00  MiB
Layer  4:   7634.95,     36.00,   7670.95     2048.00  MiB
Layer  5:   7076.89,     36.00,   7112.89     2048.00  MiB
Layer  6:   7626.28,     36.00,   7662.28     2048.00  MiB
Layer  7:   7072.53,     36.00,   7108.53     2048.00  MiB
Layer  8:   6314.22,     36.00,   6350.22     2048.00  MiB
Layer  9:   6314.22,     36.00,   6350.22     2048.00  MiB
Layer 10:   6314.22,     36.00,   6350.22     2048.00  MiB
Layer 11:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 12:   6314.22,     36.00,   6350.22     2048.00  MiB
Layer 13:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 14:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 15:   7617.11,     36.00,   7653.11     2048.00  MiB
Layer 16:   6314.22,     36.00,   6350.22     2048.00  MiB
Layer 17:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 18:   7611.90,     36.00,   7647.90     2048.00  MiB
Layer 19:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 20:   7107.90,     36.00,   7143.90     2048.00  MiB
Layer 21:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 22:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 23:   6558.51,     36.00,   6594.51     2048.00  MiB
Layer 24:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 25:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 26:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 27:   6558.51,     36.00,   6594.51     2048.00  MiB
Layer 28:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 29:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 30:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 31:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 32:   6309.86,     36.00,   6345.86     2048.00  MiB
Layer 33:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 34:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 35:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 36:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 37:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 38:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 39:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 40:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 41:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 42:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 43:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 44:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 45:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 46:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 47:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 48:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 49:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 50:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 51:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 52:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 53:   6304.65,     36.00,   6340.65     2048.00  MiB
Layer 54:   7060.65,     36.00,   7096.65     2048.00  MiB
Layer 55:   7060.65,     36.00,   7096.65     2048.00  MiB
Layer 56:   7060.65,     36.00,   7096.65     2048.00  MiB
Layer 57:   7065.86,     36.00,   7101.86     2048.00  MiB
Layer 58:   7608.51,     36.00,   7644.51     2048.00  MiB
Layer 59:   7070.14,     36.00,   7106.14     2048.00  MiB
Layer 60:   7621.92,     36.00,   7657.92     2048.00  MiB
Layer 61:    918.75,      0.00,    918.75 MiB (output layer)
--------------------------------------------------------------------------
Total   : 396633.33,   2196.00, 398829.33 MiB
Memory required for model tensors + cache: 399748 MiB
Memory available on all devices - compute: 244318 MiB
Adding experts CPU overrides for layer 6 in device 0
Adding experts CPU overrides for layer 5 in device 0
Adding experts CPU overrides for layer 4 in device 0
Memory use in device 0 is 23358 MiB after adding 3 overrides, which is less than available memory of 28302 MiB
Adding experts CPU overrides for layer 13 in device 1
Adding experts CPU overrides for layer 12 in device 1
Adding experts CPU overrides for layer 11 in device 1
Memory use in device 1 is 26679 MiB after adding 3 overrides, which is less than available memory of 28302 MiB
Adding experts CPU overrides for layer 20 in device 2
Adding experts CPU overrides for layer 19 in device 2
Adding experts CPU overrides for layer 18 in device 2
Memory use in device 2 is 27200 MiB after adding 3 overrides, which is less than available memory of 28302 MiB
Adding experts CPU overrides for layer 27 in device 3
Adding experts CPU overrides for layer 26 in device 3
Adding experts CPU overrides for layer 25 in device 3
Memory use in device 3 is 26128 MiB after adding 3 overrides, which is less than available memory of 28302 MiB
Adding experts CPU overrides for layer 32 in device 4
Adding experts CPU overrides for layer 31 in device 4
Memory use in device 4 is 19360 MiB after adding 2 overrides, which is less than available memory of 20478 MiB
Adding experts CPU overrides for layer 38 in device 5
Adding experts CPU overrides for layer 37 in device 5
Adding experts CPU overrides for layer 36 in device 5
Memory use in device 5 is 19521 MiB after adding 3 overrides, which is less than available memory of 20478 MiB
Adding experts CPU overrides for layer 49 in device 6
Adding experts CPU overrides for layer 48 in device 6
Adding experts CPU overrides for layer 47 in device 6
Adding experts CPU overrides for layer 46 in device 6
Memory use in device 6 is 45051 MiB after adding 4 overrides, which is less than available memory of 45076 MiB
Adding experts CPU overrides for layer 60 in device 7
Adding experts CPU overrides for layer 59 in device 7
Adding experts CPU overrides for layer 58 in device 7
Adding experts CPU overrides for layer 57 in device 7
Adding experts CPU overrides for layer 56 in device 7
Memory use in device 7 is 41339 MiB after adding 5 overrides, which is less than available memory of 45076 MiB

Then it logs:

Tensor blk.4.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.4.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.4.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.5.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.5.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.5.ffn_down_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.6.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.6.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.6.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.11.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.11.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.11.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.12.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.12.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.12.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.13.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.13.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.13.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.18.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.18.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.18.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.19.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.19.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.19.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.20.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.20.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.20.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.25.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.25.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.25.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.26.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.26.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.26.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.27.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.27.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.27.ffn_down_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.31.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.31.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.31.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.32.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.32.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.32.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.36.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.36.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.36.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.37.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.37.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.37.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.38.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.38.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.38.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.46.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.46.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.46.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.47.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.47.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.47.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.48.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.48.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.48.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.49.ffn_up_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.49.ffn_gate_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.49.ffn_down_exps.weight (size = 2058.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.56.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.56.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.56.ffn_down_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.57.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.57.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.57.ffn_down_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.58.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.58.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.58.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.59.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.59.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.59.ffn_down_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.60.ffn_up_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.60.ffn_gate_exps.weight (size = 2310.00 MiB) buffer type overriden to CUDA_Host
Tensor blk.60.ffn_down_exps.weight (size = 2856.00 MiB) buffer type overriden to CUDA_Host
llm_load_tensors: ggml ctx size =   43.32 MiB
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 171738.00 MiB
llm_load_tensors:      CUDA0 buffer size = 22809.14 MiB
llm_load_tensors:      CUDA1 buffer size = 26129.61 MiB
llm_load_tensors:      CUDA2 buffer size = 26650.80 MiB
llm_load_tensors:      CUDA3 buffer size = 25579.21 MiB
llm_load_tensors:      CUDA4 buffer size = 18967.97 MiB
llm_load_tensors:      CUDA5 buffer size = 19050.93 MiB
llm_load_tensors:      CUDA6 buffer size = 44187.70 MiB
llm_load_tensors:      CUDA7 buffer size = 40476.29 MiB
.......................................... =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->2
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->3
.......~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->2
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->3
......~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->3
.......~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->2
......~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 4->5
.....~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 5->4
.....~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 6->7
...........~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 7->6
..........~ggml_backend_cuda_context: have 0 graphs

And finally, I get:

llama_kv_cache_init:      CUDA0 KV buffer size =   252.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   252.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   252.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   252.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   180.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   216.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   396.00 MiB
llama_kv_cache_init:      CUDA7 KV buffer size =   396.00 MiB
llama_init_from_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_init_from_model:  CUDA_Host  output buffer size =     0.62 MiB
Segmentation fault (core dumped)

On llamacpp for reference:

srv    load_model: loading model '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-UD-IQ3_XXS-00001-of-00009.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 5090):  32110 total,  53439 used, -21839 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 5090):  32110 total,  45492 used, -13892 free vs. target of   1024
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 5090):  32110 total,  48876 used, -17275 free vs. target of   1024
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 5090):  32110 total,  45951 used, -14351 free vs. target of   1024
llama_params_fit_impl:   - CUDA4 (NVIDIA GeForce RTX 4090):  24082 total,  39145 used, -15457 free vs. target of   1024
llama_params_fit_impl:   - CUDA5 (NVIDIA GeForce RTX 4090):  24082 total,  32842 used,  -9154 free vs. target of   1024
llama_params_fit_impl:   - CUDA6 (NVIDIA A40)             :  48541 total,  70630 used, -22360 free vs. target of   1024
llama_params_fit_impl:   - CUDA7 (NVIDIA RTX A6000)       :  48541 total,  72126 used, -23855 free vs. target of   1024
llama_params_fit_impl: projected to use 408503 MiB of device memory vs. 270317 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 146377 MiB less in total
llama_params_fit_impl: context size set by user to 32768 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 239106 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA7 (NVIDIA RTX A6000)       : 62 layers,  10759 MiB used,  37510 MiB free
llama_params_fit_impl:   - CUDA6 (NVIDIA A40)             :  0 layers,      0 MiB used,  48270 MiB free
llama_params_fit_impl:   - CUDA5 (NVIDIA GeForce RTX 4090):  0 layers,      0 MiB used,  23688 MiB free
llama_params_fit_impl:   - CUDA4 (NVIDIA GeForce RTX 4090):  0 layers,      0 MiB used,  23688 MiB free
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 5090):  0 layers,      0 MiB used,  31600 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 5090):  0 layers,      0 MiB used,  31600 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 5090):  0 layers,      0 MiB used,  31600 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 5090):  0 layers,   5239 MiB used,  26360 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 5090):  5 layers ( 1 overflowing),  30296 MiB used,   1303 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 5090):  4 layers ( 1 overflowing),  30525 MiB used,   1074 MiB free
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 5090):  4 layers ( 1 overflowing),  28712 MiB used,   2887 MiB free
llama_params_fit_impl:   - CUDA3 (NVIDIA GeForce RTX 5090):  5 layers ( 1 overflowing),  30040 MiB used,   1559 MiB free
llama_params_fit_impl:   - CUDA4 (NVIDIA GeForce RTX 4090):  3 layers ( 1 overflowing),  21615 MiB used,   2072 MiB free
llama_params_fit_impl:   - CUDA5 (NVIDIA GeForce RTX 4090):  3 layers ( 1 overflowing),  21116 MiB used,   2571 MiB free
llama_params_fit_impl:   - CUDA6 (NVIDIA A40)             :  7 layers ( 1 overflowing),  46013 MiB used,   2256 MiB free
llama_params_fit_impl:   - CUDA7 (NVIDIA RTX A6000)       : 31 layers (26 overflowing),  45600 MiB used,   2670 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 12.45 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:15:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) (0000:16:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 5090) (0000:17:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:18:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 4090) (0000:0b:00.0) - 23688 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 4090) (0000:0e:00.0) - 23688 MiB free
llama_model_load_from_file_impl: using device CUDA6 (NVIDIA A40) (0000:07:00.0) - 48270 MiB free
llama_model_load_from_file_impl: using device CUDA7 (NVIDIA RTX A6000) (0000:0f:00.0) - 48270 MiB free
llama_model_loader: additional 8 GGUFs metadata loaded

And then:

load_tensors:          CPU model buffer size =   630.00 MiB
load_tensors:        CUDA0 model buffer size = 24877.11 MiB
load_tensors:        CUDA1 model buffer size = 28971.77 MiB
load_tensors:        CUDA2 model buffer size = 27158.88 MiB
load_tensors:        CUDA3 model buffer size = 28500.94 MiB
load_tensors:        CUDA4 model buffer size = 20093.71 MiB
load_tensors:        CUDA5 model buffer size = 19594.92 MiB
load_tensors:        CUDA6 model buffer size = 44348.00 MiB
load_tensors:        CUDA7 model buffer size = 42710.30 MiB
load_tensors:    CUDA_Host model buffer size = 158704.00 MiB

And then it works fine. What could be the issue? My system is pretty out of the ordinary, though.

@magikRUKKOLA
Copy link
Copy Markdown

@Panchovix

I am having the same issue with GLM-5 after pr-1506: #1506

Try to roll back.

@Panchovix
Copy link
Copy Markdown

Panchovix commented Mar 25, 2026

Thanks, I seem to get another issue.

gml_backend_cuda_buffer_type_alloc_buffer: allocating 3686.00 MiB on device 6: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA6 buffer of size 3865055232
llama_init_from_model: failed to allocate compute buffers
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_gpt_params: error: failed to create context with model '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-UD-IQ3_XXS-00001-of-00009.gguf'
 ERR [              load_model] unable to load model | tid="140162646126592" timestamp=1774483147 model="/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-UD-IQ3_XXS-00001-of-00009.gguf"
Segmentation fault (core dumped)

Using manual -ot reverting #1506 works but this took hours to get to lol

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-UD-IQ3_XXS-00001-of-00009.gguf' \
  -c 32768 \
  -ngl 999 \
  --no-mmap \
  -ot "blk.(0|1|2|3).ffn.=CUDA0,blk.(4|5|6|7).ffn.=CUDA1,blk.(8|9|10|11).ffn.=CUDA2,blk.(12|13|14|15).ffn.=CUDA3,blk.(16|17|18).ffn.=CUDA4,blk.(19|20|21).ffn.=CUDA5,blk.(22|23|24|25|26|27).ffn.=CUDA6,blk.(28|29|30|31|32|33).ffn.=CUDA7,blk.7.ffn_up_exps.weight=CUDA4,blk.34.ffn_gate_exps.weight=CUDA2,blk.34.ffn_up_exps.weight=CUDA3,blk.34.ffn_down_exps.weight=CUDA6,blk.35.ffn_gate_exps.weight=CUDA6,blk.35.ffn_down_exps.weight=CUDA7,blk.35.ffn_up_exps.weight=CUDA7,blk.36.ffn_gate_exps.weight=CUDA0,blk.36.ffn_down_exps.weight=CUDA5,exps=CPU" \
  -mg 0 \
  -ub 2560 -b 2560 -no-fmoe -mla 1

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Panchovix @magikRUKKOLA

The segmentation fault is fixed in #1515

@Panchovix

If you get OOM while allocating the compute buffers, increase the safety margin. E.g., --fit-margin 2048.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants