Skip to content

Allow empty splits#1029

Merged
ikawrakow merged 3 commits intomainfrom
ik/allow_empty_splits
Dec 3, 2025
Merged

Allow empty splits#1029
ikawrakow merged 3 commits intomainfrom
ik/allow_empty_splits

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

The main purpose of this PR is to allow empty splits when using split mode "graph". If, for instance, one has 4 GPUs, and this leads to a bad performance with split mode "graph", one could try using

-sm graph -ts 100,100,0,0 -ot "...=GPU2,...=GPU3"

to put all attention tensors, shared experts tensors, and KV cache on GPU 0 and 1, and then use GPU 2 and 3 to offload MoE tensors. In that case tensor parallel (and corresponding synchronization plus data exchange overhead) for attention and shared experts is done on only 2 GPUs, hopefully resulting in a better performance.

@Ph0rk0z Can you try this? Hopefully I have not forgotten to add checks for empty splits everywhere where needed. Perhaps also worth trying with @magikRUKKOLA's system with 3 GPUs.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 3, 2025

I tried and got this:

llama_kv_cache_init: KV cache size per device:
    Device 0:  3119.5 MiB
    Device 1:  3136.5 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
llama_new_context_with_model: KV self size  = 6256.00 MiB, K (q8_0): 3128.00 MiB, V (q8_0): 3128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
/home/supermicro/ai/ik_llama.cpp/src/llama-build-context.cpp:656: GGML_ASSERT((!split_u && !split_g && split_d) || (split_u && split_g && split_d)) failed
Could not attach to process.  If your uid matches the uid of the target

I did also check over my system and found SW_POWERCAP activates momentarily and goes away in ik_llama even at low utilization. Could be the undervolting/clock locking but I observed it with that disabled and while using 2x gpu, etc. Will check other inference engines, but WAN over NCCL utilizes 90% GPU and runs the cards over 250w for 10 minutes at a time. [I also checked at the wall] IK is showcasing it at 140-180w. Too cold to try without risers/less gpu/etc.

Counters show it spent a lot of time in that state tho:


        SW Power Cap                      : Not Active
        SW Thermal Slowdown               : Not Active
        SW Power Capping                  : 22672859915 us
        SW Thermal Slowdown               : 0 us
        SW Power Cap                      : Not Active
        SW Thermal Slowdown               : Not Active
        SW Power Capping                  : 497266165 us
        SW Thermal Slowdown               : 0 us
        SW Power Cap                      : Not Active
        SW Thermal Slowdown               : Not Active
        SW Power Capping                  : 22551501190 us
        SW Thermal Slowdown               : 0 us
        SW Power Cap                      : Not Active
        SW Thermal Slowdown               : Not Active
        SW Power Capping                  : 15335928152 us
        SW Thermal Slowdown               : 0 us
        SW Power Cap                      : Not Active
        SW Thermal Slowdown               : Not Active
        SW Power Capping                  : 30061982 us
        SW Thermal Slowdown               : 0 us

About the only lead I have.

@ikawrakow
Copy link
Copy Markdown
Owner Author

ikawrakow commented Dec 3, 2025

This assert was a typo, sorry. It should work now.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 3, 2025

This was actually a bit of an ingenious workaround. At 8192 I get decent PP.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 36.739 222.98 26.913 9.51
8192 256 8192 38.925 210.46 28.675 8.93
8192 256 16384 41.160 199.03 30.652 8.35
8192 256 24576 43.579 187.98 32.596 7.85

Granted the t/g suffers.
emptycap

the powercap may be only nvidia-persistance daemon doing it's thing though. Perhaps this is also a numa issue since this model doesn't fit on one node and TG on single node is slower for me either way.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z All we need to do now is to figure out why your TG performance is so bad...

If you feel like experimenting, edit ggml/src/ggml.c and change this line

#define IK_PRINT_TIMING 0

to

#define IK_PRINT_TIMING 1

Rebuild, and run llama-bench for just one repetition without warm-up, redirecting the output to a file. E.g.,

./bin/llama-bench -m model $other_graph_split_model_loading_args -p 0 -n 128 -t 48 -w 0 -r 1 >output.txt

Then upload the generated file here (I think it needs to have a .txt extension, else GitHub will reject the upload). I need llama-bench and not `llama-sweep-bench because we are only interested in what happens during TG.

@ikawrakow ikawrakow merged commit 0896171 into main Dec 3, 2025
@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 3, 2025

Heh.. that's pretty cool.
output.txt

I guess it will show you what ops are really laggy.

@Panchovix
Copy link
Copy Markdown

Panchovix commented Dec 4, 2025

Wow this PR is so ingenious idea! I have 2 5090s at X8/X8 5.0 from CPU and the rest slower via other ways.

I.e. I have my devices ordered like this: 5090, 4090, 4090, 5090, A6000, A40, and I load GLM 4.5 IQ4_XS like this

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.5-IQ4_XS.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16).ffn.=CUDA0" \
  -ot "blk.(17|18|19|20|21|22|23|24|25|26).ffn.=CUDA1" \
  -ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
  -ot "blk.(37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
  -ot "blk.(51|52|53|54|55|56|57|58|59|60).ffn.=CUDA4" \
  -ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=CUDA4" \
  -ot "blk.(71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5" \
  -mg 0 \
  -ub 1792

To make the TP part happen on the 5090s, I would have to add

-sm graph -ts 100,0,0,100,0,0

If I understood correctly?

Or would it be better to reorder the devices to have 5090,5090 first and then the other GPUs?

EDIT: Okay I tried a lot but the 2nd 5090 always get a cuda illegal memory access sadly,

running with

./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' -c 8192 --no-mmap -ngl 999 -sm graph -ts 100,100,0,0,0,0 -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0" -ot "blk.(15|16|17|18|19|20|21|22|23|24).ffn.=CUDA1" -ot "blk.(25|26|27|28|29|30|31|32|33|34).ffn.=CUDA2" -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn.=CUDA3" -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68).ffn.=CUDA4" -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5" -mg 0 -ub 1792

I got


Estimated model buffer size per device:
    Device 0:   3521.96 MiB
    Device 1:   3523.24 MiB
    Device 2:      3.65 MiB
    Device 3:      3.65 MiB
    Device 4:      3.65 MiB
    Device 5:      3.65 MiB
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   416.25 MiB
llm_load_tensors: CUDA_Split buffer size =  7061.06 MiB
llm_load_tensors:      CUDA0 buffer size = 23471.95 MiB
llm_load_tensors:      CUDA1 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA2 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA3 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA4 buffer size = 46370.16 MiB
llm_load_tensors:      CUDA5 buffer size = 44438.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 1792
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: attn_max_b    = 0
llama_new_context_with_model: fused_moe     = 1
llama_new_context_with_model: grouped er    = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad    = 1
llama_new_context_with_model: rope_cache    = 0
llama_new_context_with_model: graph_reuse   = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser           = -1, 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =  2944.13 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  1472 MiB
    Device 1:  1472 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
    Device 4:  0 MiB
    Device 5:  0 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   586.91 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2142.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   470.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 6489
llama_new_context_with_model: graph splits = 587
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3504
  cudaStreamSynchronize(cuda_ctx->stream())
/run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

-sm row or default works fine.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Panchovix It should work like that, so there is a remaining bug lurking around. If you would run the command you used with cuda-gdb that would be helpful. Just

cuda-gdb --args paste_your_command_here

(and then run when you see the cuda-gdb prompt).
When it crashes, it will tell you you in which kernel the error occurred, post that here. Thanks!

@Panchovix
Copy link
Copy Markdown

@ikawrakow Okay, here is the log!

Estimated model buffer size per device:
    Device 0:   3521.96 MiB
    Device 1:   3523.24 MiB
    Device 2:      3.65 MiB
    Device 3:      3.65 MiB
    Device 4:      3.65 MiB
    Device 5:      3.65 MiB
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   416.25 MiB
llm_load_tensors: CUDA_Split buffer size =  7061.06 MiB
llm_load_tensors:      CUDA0 buffer size = 23471.95 MiB
llm_load_tensors:      CUDA1 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA2 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA3 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA4 buffer size = 46370.16 MiB
llm_load_tensors:      CUDA5 buffer size = 44438.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 1792
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: attn_max_b    = 0
llama_new_context_with_model: fused_moe     = 1
llama_new_context_with_model: grouped er    = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad    = 1
llama_new_context_with_model: rope_cache    = 0
llama_new_context_with_model: graph_reuse   = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser           = -1, 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =  2944.13 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  1472 MiB
    Device 1:  1472 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
    Device 4:  0 MiB
    Device 5:  0 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   586.91 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2142.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   470.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 6489
llama_new_context_with_model: graph splits = 587
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGetLastError returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGetLastError returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

[Thread 0x7ffcccffd000 (LWP 4657) exited]
[Thread 0x7ffccd7fe000 (LWP 4656) exited]
[Thread 0x7ffccdfff000 (LWP 4655) exited]
[Thread 0x7ffd06dde000 (LWP 4653) exited]
[Thread 0x7ffd34dda000 (LWP 4652) exited]
[Thread 0x7ffd355db000 (LWP 4651) exited]
[Thread 0x7ffd36ffd000 (LWP 4650) exited]
[Thread 0x7ffd377fe000 (LWP 4649) exited]
[Thread 0x7ffd37fff000 (LWP 4648) exited]
[Thread 0x7fff74949000 (LWP 4647) exited]
[Thread 0x7fff7514a000 (LWP 4646) exited]
[Thread 0x7fff76dff000 (LWP 4619) exited]
[Thread 0x7fff781ff000 (LWP 4618) exited]
[Thread 0x7ffff7b6c000 (LWP 4614) exited]
[Thread 0x7ffcedfff000 (LWP 4654) exited]
[New process 4614]

Program terminated with signal SIGKILL, Killed.
The program no longer exists

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 5, 2025

@Panchovix

I believe you should do that prior to running the cuda-gdb.

cmake --build build --config Debug -j $(nproc)

Moreover, you'd have to do the bt full once the error has been encountered.

@Panchovix
Copy link
Copy Markdown

@magikRUKKOLA I did build with debug now but log seems to be the same.

Here is the full log, omitting which tensor to which GPU as it too long for the log.

pancho@fedora:/run/media/pancho/MX500/ChatIAs/ik_llama.cpp/lenux/bin$ cuda-gdb --args ./llama-server   -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf'   -c 8192   --no-mmap   -ngl 999 -sm graph -ts 100,100,0,0,0,0   -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0"   -ot "blk.(15|16|17|18|19|20|21|22|23|24).ffn.=CUDA1"   -ot "blk.(25|26|27|28|29|30|31|32|33|34).ffn.=CUDA2" -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn.=CUDA3"   -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68).ffn.=CUDA4" -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5"   -mg 0   -ub 1792
NVIDIA (R) cuda-gdb 13.0
Portions Copyright (C) 2007-2025 NVIDIA Corporation
Based on GNU gdb 14.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This CUDA-GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://forums.developer.nvidia.com/c/developer-tools/cuda-developer-tools/cuda-gdb>.
Find the CUDA-GDB manual and other documentation resources online at:
    <https://docs.nvidia.com/cuda/cuda-gdb/index.html>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./llama-server...
(No debugging symbols found in ./llama-server)
(cuda-gdb) run
Starting program: /run/media/pancho/MX500/ChatIAs/ik_llama.cpp/lenux/bin/llama-server -m /run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf -c 8192 --no-mmap -ngl 999 -sm graph -ts 100,100,0,0,0,0 -ot blk.\(0\|1\|2\|3\|4\|5\|6\|7\|8\|9\|10\|11\|12\|13\|14\).ffn.=CUDA0 -ot blk.\(15\|16\|17\|18\|19\|20\|21\|22\|23\|24\).ffn.=CUDA1 -ot blk.\(25\|26\|27\|28\|29\|30\|31\|32\|33\|34\).ffn.=CUDA2 -ot blk.\(35\|36\|37\|38\|39\|40\|41\|42\|43\|44\).ffn.=CUDA3 -ot blk.\(45\|46\|47\|48\|49\|50\|51\|52\|53\|54\|55\|56\|57\|58\|59\|60\|61\|62\|63\|64\|65\|66\|67\|68\).ffn.=CUDA4 -ot blk.\(69\|70\|71\|72\|73\|74\|75\|76\|77\|78\|79\|80\|81\|82\|83\|84\|85\|86\|87\|88\|89\|90\|91\|92\).ffn.=CUDA5 -mg 0 -ub 1792
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fff781ff000 (LWP 7335)]
[New Thread 0x7fff76dff000 (LWP 7336)]
[Detaching after fork from child process 7337]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32111 MiB
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32111 MiB
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24082 MiB
  Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24082 MiB
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes, VRAM: 48541 MiB
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes, VRAM: 48541 MiB
INFO [                    main] build info | tid="140737349337088" timestamp=1764944870 build=4042 commit="b715342e"
INFO [                    main] system info | tid="140737349337088" timestamp=1764944870 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
warning: Cuda Driver error detected: No CUDA context is current to the calling thread
warning: Cuda Driver error detected: Returning 201 (CUDA_ERROR_INVALID_CONTEXT) from cuCtxGetDevice_v2
[New Thread 0x7fff7514a000 (LWP 7363)]
[New Thread 0x7fff74949000 (LWP 7364)]
[New Thread 0x7ffd37fff000 (LWP 7365)]
CUDA0: using device CUDA0 - 31578 MiB free
[New Thread 0x7ffd377fe000 (LWP 7366)]
[New Thread 0x7ffd36ffd000 (LWP 7367)]
CUDA1: using device CUDA1 - 31578 MiB free
[New Thread 0x7ffd355db000 (LWP 7368)]
[New Thread 0x7ffd34dda000 (LWP 7369)]
CUDA2: using device CUDA2 - 23666 MiB free
[New Thread 0x7ffd06dde000 (LWP 7370)]
[New Thread 0x7ffcedfff000 (LWP 7371)]
CUDA3: using device CUDA3 - 23666 MiB free
[New Thread 0x7ffccdfff000 (LWP 7372)]
[New Thread 0x7ffccd7fe000 (LWP 7373)]
CUDA4: using device CUDA4 - 48252 MiB free
[New Thread 0x7ffcccffd000 (LWP 7374)]
CUDA5: using device CUDA5 - 48252 MiB free
llama_model_loader: loaded meta data with 57 key-value pairs and 1759 tensors from /run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Glm-4.6
llama_model_loader: - kv   3:                            general.version str              = 4.6
llama_model_loader: - kv   4:                           general.basename str              = Glm-4.6
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 160x19B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = GLM 4.6
llama_model_loader: - kv  11:               general.base_model.0.version str              = 4.6
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Zai Org
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/zai-org/GLM-4.6
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  16:                        glm4moe.block_count u32              = 93
llama_model_loader: - kv  17:                     glm4moe.context_length u32              = 202752
llama_model_loader: - kv  18:                   glm4moe.embedding_length u32              = 5120
llama_model_loader: - kv  19:                glm4moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  20:               glm4moe.attention.head_count u32              = 96
llama_model_loader: - kv  21:            glm4moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  22:                     glm4moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  23:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  24:                  glm4moe.expert_used_count u32              = 8
llama_model_loader: - kv  25:               glm4moe.attention.key_length u32              = 128
llama_model_loader: - kv  26:             glm4moe.attention.value_length u32              = 128
llama_model_loader: - kv  27:               glm4moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                       glm4moe.expert_count u32              = 160
llama_model_loader: - kv  29:         glm4moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  30:                glm4moe.expert_shared_count u32              = 1
llama_model_loader: - kv  31:          glm4moe.leading_dense_block_count u32              = 3
llama_model_loader: - kv  32:                 glm4moe.expert_gating_func u32              = 2
llama_model_loader: - kv  33:               glm4moe.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  34:                glm4moe.expert_weights_norm bool             = true
llama_model_loader: - kv  35:               glm4moe.nextn_predict_layers u32              = 1
llama_model_loader: - kv  36:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  37:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  38:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  39:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  40:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  41:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  42:            tokenizer.ggml.padding_token_id u32              = 151330
llama_model_loader: - kv  43:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  44:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  45:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  46:                tokenizer.ggml.eom_token_id u32              = 151338
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {#  Unsloth template fixes  #}[gMASK]...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 30
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = GLM-4.6-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = unsloth_calibration_GLM-4.6.txt
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 1000
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 51
llama_model_loader: - kv  54:                                   split.no u16              = 0
llama_model_loader: - kv  55:                        split.tensors.count i32              = 1759
llama_model_loader: - kv  56:                                split.count u16              = 0
llama_model_loader: - type  f32:  835 tensors
llama_model_loader: - type q4_K:    8 tensors
llama_model_loader: - type q5_K:  273 tensors
llama_model_loader: - type q6_K:   91 tensors
llama_model_loader: - type iq4_xs:  552 tensors
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = glm4moe
llm_load_print_meta: n_ctx_train      = 202752
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 93
llm_load_print_meta: n_head           = 96
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 160
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 202752
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 355B.A32B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 356.786 B
llm_load_print_meta: model size       = 177.585 GiB (4.276 BPW)
llm_load_print_meta: repeating layers = 176.586 GiB (4.270 BPW, 355.234 B parameters)
llm_load_print_meta: general.name     = Glm-4.6
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151330 '[MASK]'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size =   17.77 MiB
Estimated model buffer size per device:
    Device 0:   3521.96 MiB
    Device 1:   3523.24 MiB
    Device 2:      3.65 MiB
    Device 3:      3.65 MiB
    Device 4:      3.65 MiB
    Device 5:      3.65 MiB
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   416.25 MiB
llm_load_tensors: CUDA_Split buffer size =  7061.06 MiB
llm_load_tensors:      CUDA0 buffer size = 23471.95 MiB
llm_load_tensors:      CUDA1 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA2 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA3 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA4 buffer size = 46370.16 MiB
llm_load_tensors:      CUDA5 buffer size = 44438.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 1792
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: attn_max_b    = 0
llama_new_context_with_model: fused_moe     = 1
llama_new_context_with_model: grouped er    = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad    = 1
llama_new_context_with_model: rope_cache    = 0
llama_new_context_with_model: graph_reuse   = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser           = -1, 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =  2944.13 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  1472 MiB
    Device 1:  1472 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
    Device 4:  0 MiB
    Device 5:  0 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   586.91 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2142.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   470.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 6489
llama_new_context_with_model: graph splits = 587
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGetLastError returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Runtime API error detected: cudaGetLastError returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: Returning 910 (CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE) from cuGraphExecUpdate_v2
warning: Cuda Runtime API error detected: cudaGraphExecUpdate returned cudaErrorGraphExecUpdateFailure(CUresult=910): the graph update was not performed because it included changes which violated constraints specific to instantiated graph update

[Thread 0x7ffccd7fe000 (LWP 7373) exited]
[Thread 0x7ffccdfff000 (LWP 7372) exited]
[Thread 0x7ffcedfff000 (LWP 7371) exited]
[Thread 0x7ffd06dde000 (LWP 7370) exited]
[Thread 0x7ffd34dda000 (LWP 7369) exited]
[Thread 0x7ffd355db000 (LWP 7368) exited]
[Thread 0x7ffd36ffd000 (LWP 7367) exited]
[Thread 0x7ffd377fe000 (LWP 7366) exited]
[Thread 0x7ffd37fff000 (LWP 7365) exited]
[Thread 0x7fff74949000 (LWP 7364) exited]
[Thread 0x7fff7514a000 (LWP 7363) exited]
[Thread 0x7fff76dff000 (LWP 7336) exited]
[Thread 0x7fff781ff000 (LWP 7335) exited]
[Thread 0x7ffff7b6c000 (LWP 7331) exited]
[Thread 0x7ffcccffd000 (LWP 7374) exited]
[New process 7331]

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(cuda-gdb) bt full
No stack.

@Panchovix
Copy link
Copy Markdown

Panchovix commented Dec 5, 2025

Wait, there was a debug update 1 hour ago haha, let me update and try again.

EDIT: nope, it's the same.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Panchovix

I don't see a real error in the above. Perhaps you can try

GGML_CUDA_DISABLE_GRAPHS=1 you_command_goes_here

CUDA graphs normally get disabled after a few attempts to capture a graph, but it looks like in your case it leads to a real error?

One cannot use CUDA graphs with split mode "graph", so I guess I need to fix the code to not even attempt to capture a graph.

@Panchovix
Copy link
Copy Markdown

@ikawrakow ran

GGML_CUDA_DISABLE_GRAPHS=1 ./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' -c 8192 --no-mmap -ngl 999 -sm graph -ts 100,100,0,0,0,0 -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0" -ot "blk.(15|16|17|18|19|20|21|22|23|24).ffn.=CUDA1" -ot "blk.(25|26|27|28|29|30|31|32|33|34).ffn.=CUDA2" -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn.=CUDA3" -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68).ffn.=CUDA4" -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5" -mg 0 -ub 1792

And got the same issue I think, but it loaded way faster haha

Estimated model buffer size per device:
    Device 0:   3521.96 MiB
    Device 1:   3523.24 MiB
    Device 2:      3.65 MiB
    Device 3:      3.65 MiB
    Device 4:      3.65 MiB
    Device 5:      3.65 MiB
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   416.25 MiB
llm_load_tensors: CUDA_Split buffer size =  7061.06 MiB
llm_load_tensors:      CUDA0 buffer size = 23471.95 MiB
llm_load_tensors:      CUDA1 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA2 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA3 buffer size = 19320.90 MiB
llm_load_tensors:      CUDA4 buffer size = 46370.16 MiB
llm_load_tensors:      CUDA5 buffer size = 44438.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 1792
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: attn_max_b    = 0
llama_new_context_with_model: fused_moe     = 1
llama_new_context_with_model: grouped er    = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad    = 1
llama_new_context_with_model: rope_cache    = 0
llama_new_context_with_model: graph_reuse   = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser           = -1, 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =  2944.13 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  1472 MiB
    Device 1:  1472 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
    Device 4:  0 MiB
    Device 5:  0 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   586.91 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2142.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   470.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 6489
llama_new_context_with_model: graph splits = 587
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3504
  cudaStreamSynchronize(cuda_ctx->stream())
/run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error
[New LWP 9947]
[New LWP 9946]
[New LWP 9945]
[New LWP 9944]
[New LWP 9943]
[New LWP 9939]
[New LWP 9938]
[New LWP 9937]
[New LWP 9936]
[New LWP 9935]
[New LWP 9934]
[New LWP 9933]
[New LWP 9914]

@ikawrakow
Copy link
Copy Markdown
Owner Author

OK, so now you can try running with cuda-gdb.

@Panchovix
Copy link
Copy Markdown

@ikawrakow
Ran

GGML_CUDA_DISABLE_GRAPHS=1 cuda-gdb --args ./llama-server   -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6
-IQ4_XS.gguf'   -c 8192   --no-mmap   -ngl 999 -sm graph -ts 100,100,0,0,0,0   -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0"   -ot "blk.(15|16|17|18|19|20|21|22|23|24).ff
n.=CUDA1"   -ot "blk.(25|26|27|28|29|30|31|32|33|34).ffn.=CUDA2" -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn.=CUDA3"   -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|6
2|63|64|65|66|67|68).ffn.=CUDA4" -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5"   -mg 0   -ub 1792

(also did an export GGML_CUDA_DISABLE_GRAPHS=1 before in any case, not sure if it has to be run like that)

Got a different error

llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   586.91 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2142.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   452.65 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   470.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 6489
llama_new_context_with_model: graph splits = 587
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context
warning: Cuda Driver error detected: CUDA Stream does not belong to the expected context

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffd5b968580  void fused_rms_norm_f32<1024, float>(float const*, float const*, float*, int, float)

Thread 1 "llama-server" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 15, block (0,0,0), thread (928,0,0), device 0, sm 0, warp 31, lane 0]
0x00007ffd5b9685d0 in void fused_rms_norm_f32<1024, float>(float const*, float const*, float*, int, float)<<<(1,1,1),(1024,1,1)>>> ()
(cuda-gdb) bt full
#0  0x00007ffd5b9685d0 in void fused_rms_norm_f32<1024, float>(float const*, float const*, float*, int, float)<<<(1,1,1),(1024,1,1)>>> ()
No symbol table info available.
(cuda-gdb)

@ikawrakow
Copy link
Copy Markdown
Owner Author

Thanks! I don't have a hypothesis of what could be wrong, but at least we now know the kernel where the illegal memory access occurs.

Btw., I'm noticing that you are using overrides of the type blk\.(...)\.ffn=CUDAX. This will send the shared experts to these devices as well. But my guess is that it would be better to keep the shared experts on the graph parallel GPUs along with the corresponding attention tensors and KV cache. So, I think your tensor overrides should be

 -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)\.ffn_(up|gate|down)_exps\.weight=CUDA0"

etc.

@Panchovix
Copy link
Copy Markdown

That for both CUDA 0 and CUDA 1 devices? And thanks for all the work!

@ikawrakow
Copy link
Copy Markdown
Owner Author

ikawrakow commented Dec 5, 2025

All of your ffn overrides should be like that. If they are not, shared experts in layers 25...92 will go to CUDA2,3,4,5, but you want them to go to CUDA0 and CUDA1. You only want the routed experts in layers 25...92 to go to CUDA2-5.

@Panchovix
Copy link
Copy Markdown

Something like this?

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' \
  -c 8192 \
  --no-mmap \
  -ngl 999 \
  -sm graph \
  -ts 100,100,0,0,0,0 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn_(up|gate|down)_exps.weight=CUDA0" \
  -ot "blk.(15|16|17|18|19|20|21|22|23|24).ffn_(up|gate|down)_exps.weight=CUDA1" \
  -ot "blk.(25|26|27|28|29|30|31|32|33|34).ffn_(up|gate|down)_exps.weight=CUDA2" \
  -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn_(up|gate|down)_exps.weight=CUDA3" \
  -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68).ffn_(up|gate|down)_exps.weight=CUDA4" \
  -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn_(up|gate|down)_exps.weight=CUDA5" \
  -mg 0 \
  -ub 1792

Or did I got confused?

Would that help with default parallel processing as well?

@Panchovix
Copy link
Copy Markdown

Wait, using that let me to load the model!

@ikawrakow
Copy link
Copy Markdown
Owner Author

That should be it.

I don't think this will be better for sm "layer". This is specific to sm "graph". But it doesn't hurt to try and see what happens.

@Panchovix
Copy link
Copy Markdown

Panchovix commented Dec 5, 2025

Okay now it starts to gen! But got an error when generating

Ran with

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' \
  -c 8192 \
  --no-mmap \
  -ngl 999 \
  -sm graph \
  -ts 100,100,0,0,0,0 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13).ffn_(up|gate|down)_exps.weight=CUDA0" \
  -ot "blk.(14|15|16|17|18|19|20|21|22|23).ffn_(up|gate|down)_exps.weight=CUDA1" \
  -ot "blk.(24|25|26|27|28|29|30|31|32|33|34).ffn_(up|gate|down)_exps.weight=CUDA2" \
  -ot "blk.(35|36|37|38|39|40|41|42|43|44).ffn_(up|gate|down)_exps.weight=CUDA3" \
  -ot "blk.(45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68).ffn_(up|gate|down)_exps.weight=CUDA4" \
  -ot "blk.(69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn_(up|gate|down)_exps.weight=CUDA5" \
  -mg 0 \
  -ub 1792
llm_load_tensors:  CUDA_Host buffer size =   416.25 MiB
llm_load_tensors: CUDA_Split buffer size =  9875.02 MiB
llm_load_tensors:      CUDA0 buffer size = 21644.55 MiB
llm_load_tensors:      CUDA1 buffer size = 19125.00 MiB
llm_load_tensors:      CUDA2 buffer size = 21037.50 MiB
llm_load_tensors:      CUDA3 buffer size = 19125.00 MiB
llm_load_tensors:      CUDA4 buffer size = 45900.00 MiB
llm_load_tensors:      CUDA5 buffer size = 43987.50 MiB
....................................................................................................
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 1792
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: attn_max_b    = 0
llama_new_context_with_model: fused_moe     = 1
llama_new_context_with_model: grouped er    = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad    = 1
llama_new_context_with_model: rope_cache    = 0
llama_new_context_with_model: graph_reuse   = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser           = -1, 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: CUDA_Split KV buffer size =  2944.13 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  1464 MiB
    Device 1:  1480 MiB
    Device 2:  0 MiB
    Device 3:  0 MiB
    Device 4:  0 MiB
    Device 5:  0 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1088.50 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   462.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   400.15 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   400.15 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   400.15 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   400.15 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.02 MiB
llama_new_context_with_model: graph nodes  = 7216
llama_new_context_with_model: graph splits = 710
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
INFO [                    init] initializing slots | tid="139989165223936" timestamp=1764950291 n_slots=1
INFO [                    init] new slot | tid="139989165223936" timestamp=1764950291 id_slot=0 n_ctx_slot=8192
prompt cache is enabled, size limit: 8192 MiB
use `--cache-ram 0` to disable the prompt cache
INFO [                    main] model loaded | tid="139989165223936" timestamp=1764950291
INFO [                    main] chat template | tid="139989165223936" timestamp=1764950291 chat_template="{#  Unsloth template fixes  #}[gMASK]<sop>\n{%- if tools -%}\n<|system|>\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{% for tool in tools %}\n{{ tool | tojson|string }}\n{% endfor %}\n</tools>\n\nFor each function call, output the function name and arguments within the following XML format:\n<tool_call>{function-name}\n<arg_key>{arg-key-1}</arg_key>\n<arg_value>{arg-value-1}</arg_value>\n<arg_key>{arg-key-2}</arg_key>\n<arg_value>{arg-value-2}</arg_value>\n...\n</tool_call>{%- endif -%}\n{%- macro visible_text(content) -%}\n    {%- if content is string -%}\n        {{- content }}\n    {%- elif content is iterable and content is not mapping -%}\n        {%- for item in content -%}\n            {%- if item is mapping and item.type == 'text' -%}\n                {{- item.text }}\n            {%- elif item is string -%}\n                {{- item }}\n            {%- endif -%}\n        {%- endfor -%}\n    {%- else -%}\n        {{- content }}\n    {%- endif -%}\n{%- endmacro -%}\n{%- set ns = namespace(last_user_index=-1) %}\n{%- for m in messages %}\n    {%- if m.role == 'user' %}\n        {% set ns.last_user_index = loop.index0 -%}\n    {%- endif %}\n{%- endfor %}\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>\n{%- set content = visible_text(m.content)|string %}{{ content }}\n{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not content.endswith(\"/nothink\")) else '' -}}\n{%- elif m.role == 'assistant' -%}\n<|assistant|>\n{%- set reasoning_content = '' %}\n{%- set content = visible_text(m.content)|string %}\n{%- if m.reasoning_content is defined and m.reasoning_content is string %}\n    {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n    {#  Unsloth template fixes - must change to for loop since llama.cpp will error out if not #}\n    {%- set parts = content.split('</think>') %}\n    {% for part in parts %}\n        {%- if loop.index0 == 0 -%}\n            {%- set reasoning_content = (part.split(\"<think>\")|last) %}\n            {%- set reasoning_content = reasoning_content.lstrip('\\n').rstrip('\\n') -%}\n        {%- else -%}\n            {%- set content = part.lstrip('\\n') %}\n        {%- endif %}\n    {%- endfor %}\n{%- endif %}\n{%- if loop.index0 > ns.last_user_index and reasoning_content -%}\n{{ '\\n<think>' + reasoning_content.strip() +  '</think>'}}\n{%- else -%}\n{{ '\\n<think></think>' }}\n{%- endif -%}\n{%- if content.strip() -%}\n{{ '\\n' + content.strip() }}\n{%- endif -%}\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n{%- if tc.function %}\n    {%- set tc = tc.function %}\n{%- endif %}\n{{ '\\n<tool_call>' + tc.name }}\n{% set _args = tc.arguments %}\n{%- if _args is not mapping -%}\n    {%- set _args = {} %}\n{%- endif -%}\n{% for k, v in _args|items %}\n<arg_key>{{ k }}</arg_key>\n<arg_value>{{ v | tojson|string if v is not string else v }}</arg_value>\n{% endfor %}\n</tool_call>{% endfor %}\n{% endif %}\n{%- elif m.role == 'tool' -%}\n{%- if m.content is string -%}\n{%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n    {{- '<|observation|>' }}\n{%- endif %}\n{{- '\\n<tool_response>\\n' }}\n{{- m.content }}\n{{- '\\n</tool_response>' }}\n{%- else -%}\n<|observation|>{% for tr in m.content %}\n\n<tool_response>\n{{ tr.output if tr.output is defined else tr }}\n</tool_response>{% endfor -%}\n{% endif -%}\n{%- elif m.role == 'system' -%}\n<|system|>\n{{ visible_text(m.content)|string }}\n{%- endif -%}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n    <|assistant|>{{- '\\n<think></think>' if (enable_thinking is defined and not enable_thinking) else '' -}}\n{%- endif -%}{#  Copyright 2025-present Unsloth. Apache 2.0 License.  #}"
INFO [                    main] chat template | tid="139989165223936" timestamp=1764950291 chat_example="[gMASK]<sop><|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\nHi there<|user|>\nHow are you?<|assistant|>" built_in=true
INFO [                    main] HTTP server listening | tid="139989165223936" timestamp=1764950291 n_threads_http="23" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="139989165223936" timestamp=1764950291
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139989165223936" timestamp=1764950302 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [            update_slots] kv cache rm [p0, end) | tid="139989165223936" timestamp=1764950302 id_slot=0 id_task=0 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="139989165223936" timestamp=1764950305 id_slot=0 id_task=0 p0=2048
INFO [            update_slots] kv cache rm [p0, end) | tid="139989165223936" timestamp=1764950308 id_slot=0 id_task=0 p0=4096
CUDA error: unspecified launch failure
  current device: 1, in function ggml_backend_cuda_synchronize at /run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3504
  cudaStreamSynchronize(cuda_ctx->stream())
/run/media/pancho/MX500/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

EDIT: nvm it was a prompt issue I think, now it works fine!

@ikawrakow
Copy link
Copy Markdown
Owner Author

Cool that it works! I'm curious if it gives you a better performance than the default split mode "layer".

@Panchovix
Copy link
Copy Markdown

Okay with graph I get

INFO [           print_timings] prompt eval time     =    6491.99 ms /  4410 tokens (    1.47 ms per token,   679.30 tokens per second) | tid="140009121681408" timestamp=1764950781 id_slot=0 id_task=0 t_prompt_processing=6491.99 n_prompt_tokens_processed=4410 t_token=1.4721065759637189 n_tokens_second=679.2986434051809
INFO [           print_timings] generation eval time =   73806.86 ms /  1878 runs   (   39.30 ms per token,    25.44 tokens per second) | tid="140009121681408" timestamp=1764950781 id_slot=0 id_task=0 t_token_generation=73806.86 n_decoded=1878 t_token=39.300777422790205 n_tokens_second=25.44478927839499

Running a similar command but adapter for non graph (aka using less on other gpus and more on cuda 0 and 1), I get

INFO [           print_timings] prompt eval time     =    3837.94 ms /  4410 tokens (    0.87 ms per token,  1149.06 tokens per second) | tid="140119249375232" timestamp=1764962993 id_slot=0 id_task=0 t_prompt_processing=3837.936 n_prompt_tokens_processed=4410 t_token=0.8702802721088436 n_tokens_second=1149.055117125455
INFO [           print_timings] generation eval time =   65094.79 ms /  1689 runs   (   38.54 ms per token,    25.95 tokens per second) | tid="140119249375232" timestamp=1764962993 id_slot=0 id_task=0 t_token_generation=65094.785 n_decoded=1689 t_token=38.54042924807579 n_tokens_second=25.94677899312518

So sadly on my case it's a bit slower but I'm not surprised either, as too much GPUs are running on chipset lanes (I hope in some weeks help myself a little with some switches on CPU slots)

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 6, 2025

@ikawrakow

Мега работа вършиш, браво на гения!

Sorry to bother but what could be the next steps in case one have a bunch of RTX 3090? Unfortunately, right now I do have only 3 GPU system but pretty soon we will build the quad RTX 3090 with water cooling and DDR5 4800 MT/s. We will have the NvLinks as well. The 96 GB of VRAM total (the Xeon QYFS does not support more than four PCIe v.4 x16 anyways). It should be possible to use the speculative decoding and offload it to the LAN-connected machine (which would be like simple two GPU rig with no RAM whatsoever)?

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 6, 2025

@ikawrakow

I'm curious if it gives you a better performance than the default split mode "layer".

template command:

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
    --warmup-batch \
    --model /opt/THIREUS/GLM-4.6-5.4976bpw/GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01760.gguf \
    --alias THIREUS/GLM-4.6-5.4976bpw \
    --ctx-size $((64 * 1024)) \
    -b $((4 * 1024)) -ub $((4 * 1024)) \
    --mlock \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.1 \
    -ctk q6_0 \
    -ctv q6_0 \
    -khad \
    -amb 512 \
    --split-mode layer \
    -ts 100,100,0 \
    --main-gpu 2 \
    -ot "blk.(0|1|2|3|4|5).ffn_(up|gate|down)_exps.weight=CUDA0" \
    -ot "blk.(6|7|8|9).ffn_(up|gate|down)_exps.weight=CUDA1" \
    -ot "blk.(10|11|12|13|14|15|16|17).ffn_(up|gate|down)_exps.weight=CUDA2" \
    -gr \
    -ger \
    --cpu-moe \
    --merge-qkv \
    --n-gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --verbosity 1 \
    --verbose-prompt \
    --reasoning-format auto \
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --keep -1 \
    --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

Note. -ts 1,1,1 -ot "blk.(10|11|12|13|14|15).ffn_(up|gate|down)_exps.weight=CUDA2" --main-gpu 1 has been used for the data points of layer and graph respectively.
The -ts 100,100,0 has been used for the layer zero-split and graph zero-split, respectively.
The --split-mode graph was used for all graph* data points.

GRAPHS

prefill-combined2

decode-combined2

*related: #1026

data:

Details
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server  --version
version: 4046 (a3737f42)
built with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu

File: bench-v4036-3gpu-graph-v4046-default-split.log

[1764974374] 
[1764974374] main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
[1764974374] 
[1764974374] |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
[1764974374] |-------|--------|--------|----------|----------|----------|----------|
[1764974538] |  4096 |   1024 |      0 |   18.279 |   224.08 |  125.894 |     8.13 |
[1764974684] |  4096 |   1024 |   4096 |   18.506 |   221.34 |  128.038 |     8.00 |
[1764974833] |  4096 |   1024 |   8192 |   18.870 |   217.06 |  129.708 |     7.89 |
[1764974985] |  4096 |   1024 |  12288 |   19.265 |   212.61 |  132.654 |     7.72 |
[1764975140] |  4096 |   1024 |  16384 |   19.675 |   208.18 |  134.879 |     7.59 |
[1764975297] |  4096 |   1024 |  20480 |   20.073 |   204.06 |  137.259 |     7.46 |
[1764975458] |  4096 |   1024 |  24576 |   20.418 |   200.61 |  140.695 |     7.28 |
[1764975622] |  4096 |   1024 |  28672 |   20.871 |   196.25 |  143.578 |     7.13 |
[1764975790] |  4096 |   1024 |  32768 |   21.273 |   192.54 |  146.603 |     6.98 |
[1764975962] |  4096 |   1024 |  36864 |   21.723 |   188.55 |  149.909 |     6.83 |
[1764976137] |  4096 |   1024 |  40960 |   22.144 |   184.97 |  152.829 |     6.70 |
[1764976315] |  4096 |   1024 |  45056 |   22.645 |   180.87 |  155.116 |     6.60 |
[1764976496] |  4096 |   1024 |  49152 |   23.091 |   177.38 |  158.576 |     6.46 |
[1764976681] |  4096 |   1024 |  53248 |   23.544 |   173.97 |  160.983 |     6.36 |
[1764976869] |  4096 |   1024 |  57344 |   23.985 |   170.78 |  164.085 |     6.24 |
[1764977060] |  4096 |   1024 |  61440 |   24.361 |   168.13 |  167.178 |     6.13 |

File: bench-v4036-3gpu-graph-v4046-empty-splits-enabled-q6_0kv_khad.log

[1764997490] 
[1764997490] main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
[1764997490] 
[1764997490] |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
[1764997490] |-------|--------|--------|----------|----------|----------|----------|
[1764997646] |  4096 |   1024 |      0 |   17.344 |   236.17 |  119.963 |     8.54 |
[1764997788] |  4096 |   1024 |   4096 |   17.807 |   230.02 |  123.480 |     8.29 |
[1764997934] |  4096 |   1024 |   8192 |   18.258 |   224.33 |  128.195 |     7.99 |
[1764998085] |  4096 |   1024 |  12288 |   18.842 |   217.39 |  132.455 |     7.73 |
[1764998240] |  4096 |   1024 |  16384 |   19.339 |   211.80 |  135.702 |     7.55 |
[1764998400] |  4096 |   1024 |  20480 |   19.921 |   205.61 |  139.193 |     7.36 |
[1764998563] |  4096 |   1024 |  24576 |   20.435 |   200.44 |  142.789 |     7.17 |
[1764998731] |  4096 |   1024 |  28672 |   21.083 |   194.28 |  146.763 |     6.98 |
[1764998903] |  4096 |   1024 |  32768 |   21.617 |   189.48 |  150.603 |     6.80 |
[1764999080] |  4096 |   1024 |  36864 |   22.148 |   184.94 |  154.534 |     6.63 |
[1764999260] |  4096 |   1024 |  40960 |   22.788 |   179.74 |  157.813 |     6.49 |
[1764999445] |  4096 |   1024 |  45056 |   23.565 |   173.82 |  161.636 |     6.34 |
[1764999635] |  4096 |   1024 |  49152 |   23.901 |   171.37 |  165.573 |     6.18 |
[1764999829] |  4096 |   1024 |  53248 |   24.458 |   167.47 |  169.944 |     6.03 |
[1765000028] |  4096 |   1024 |  57344 |   24.988 |   163.92 |  173.585 |     5.90 |
[1765000230] |  4096 |   1024 |  61440 |   25.967 |   157.74 |  176.321 |     5.81 |

File: bench-v4036-3gpu-layer-v4046-default-split.log

[1764977132] 
[1764977132] main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
[1764977132] 
[1764977132] |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
[1764977132] |-------|--------|--------|----------|----------|----------|----------|
[1764977277] |  4096 |   1024 |      0 |   17.307 |   236.66 |  108.144 |     9.47 |
[1764977410] |  4096 |   1024 |   4096 |   18.249 |   224.45 |  115.126 |     8.89 |
[1764977553] |  4096 |   1024 |   8192 |   19.460 |   210.48 |  123.198 |     8.31 |
[1764977704] |  4096 |   1024 |  12288 |   20.626 |   198.59 |  131.122 |     7.81 |
[1764977865] |  4096 |   1024 |  16384 |   21.782 |   188.04 |  138.986 |     7.37 |
[1764978035] |  4096 |   1024 |  20480 |   23.031 |   177.85 |  146.550 |     6.99 |
[1764978213] |  4096 |   1024 |  24576 |   24.078 |   170.11 |  154.555 |     6.63 |
[1764978401] |  4096 |   1024 |  28672 |   25.495 |   160.66 |  162.088 |     6.32 |
[1764978597] |  4096 |   1024 |  32768 |   26.612 |   153.92 |  169.389 |     6.05 |
[1764978802] |  4096 |   1024 |  36864 |   27.693 |   147.91 |  177.460 |     5.77 |
[1764979017] |  4096 |   1024 |  40960 |   29.091 |   140.80 |  185.267 |     5.53 |
[1764979240] |  4096 |   1024 |  45056 |   30.622 |   133.76 |  192.732 |     5.31 |
[1764979472] |  4096 |   1024 |  49152 |   31.481 |   130.11 |  200.425 |     5.11 |
[1764979713] |  4096 |   1024 |  53248 |   32.590 |   125.68 |  208.731 |     4.91 |
[1764979963] |  4096 |   1024 |  57344 |   33.972 |   120.57 |  216.114 |     4.74 |
[1764980223] |  4096 |   1024 |  61440 |   35.522 |   115.31 |  224.076 |     4.57 |

File: bench-v4036-3gpu-layer-v4046-empty-splits-enabled-q6_0kv_khad.log

[1765001388] 
[1765001388] main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
[1765001388] 
[1765001388] |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
[1765001388] |-------|--------|--------|----------|----------|----------|----------|
[1765001531] |  4096 |   1024 |      0 |   16.757 |   244.43 |  107.331 |     9.54 |
[1765001664] |  4096 |   1024 |   4096 |   17.816 |   229.90 |  115.269 |     8.88 |
[1765001806] |  4096 |   1024 |   8192 |   19.030 |   215.24 |  123.314 |     8.30 |
[1765001957] |  4096 |   1024 |  12288 |   20.226 |   202.51 |  130.895 |     7.82 |
[1765002116] |  4096 |   1024 |  16384 |   21.365 |   191.72 |  137.658 |     7.44 |
[1765002284] |  4096 |   1024 |  20480 |   22.627 |   181.02 |  145.729 |     7.03 |
[1765002461] |  4096 |   1024 |  24576 |   23.584 |   173.67 |  152.946 |     6.70 |
[1765002647] |  4096 |   1024 |  28672 |   25.078 |   163.33 |  160.838 |     6.37 |
[1765002842] |  4096 |   1024 |  32768 |   26.195 |   156.37 |  168.420 |     6.08 |
[1765003044] |  4096 |   1024 |  36864 |   27.030 |   151.54 |  175.520 |     5.83 |
[1765003256] |  4096 |   1024 |  40960 |   28.695 |   142.74 |  183.296 |     5.59 |
[1765003477] |  4096 |   1024 |  45056 |   30.137 |   135.91 |  190.715 |     5.37 |
[1765003706] |  4096 |   1024 |  49152 |   30.911 |   132.51 |  197.959 |     5.17 |
[1765003944] |  4096 |   1024 |  53248 |   32.019 |   127.93 |  206.203 |     4.97 |
[1765004190] |  4096 |   1024 |  57344 |   33.136 |   123.61 |  213.206 |     4.80 |
[1765004445] |  4096 |   1024 |  61440 |   34.833 |   117.59 |  220.242 |     4.65 |

system info

Details
Detected 3 NVIDIA GPU(s)

NVOC System Report
==================
System: Linux xxx 6.16.12+deb14+1-amd64 x86_64
Driver Version: 580.105.08

System Temperatures:
  CPU: 68.0°C [OK] (Nominal: 35.0°C, Warn: 97.0°C, Crit: 100.0°C)
  RAM: 73.0°C [OK] (Nominal: 35.0°C, Warn: 81.0°C, Crit: 83.0°C)
  VR:  80.0°C [OK] (Nominal: 35.0°C, Warn: 115.0°C, Crit: 120.0°C)

GPU 0: NVIDIA GeForce RTX 3090
------------------------------------------------
  PCI Bus ID:        00000000:41:00.0
  VBIOS Version:     94.02.4B.00.0B
  Persistence Mode:  Enabled
  Core Temperature:  67°C
  Power Usage:       220W
  Current Power Limit: 400W
  Power Limits:      Default: 350W, Min: 100W, Max: 400W
  GPU Clock:         2025 MHz
  VRAM Clock:        10251 MHz
  GPU Utilization:   24%
  VRAM Utilization:  18%
  Memory Usage:      22.2 / 24.0 GB
  Applied Offsets:   GPU: 100 MHz, VRAM: 1500 MHz

GPU 1: NVIDIA GeForce RTX 3090
------------------------------------------------
  PCI Bus ID:        00000000:42:00.0
  VBIOS Version:     94.02.4B.00.0B
  Persistence Mode:  Enabled
  Core Temperature:  52°C
  Power Usage:       202W
  Current Power Limit: 400W
  Power Limits:      Default: 350W, Min: 100W, Max: 400W
  GPU Clock:         2070 MHz
  VRAM Clock:        10251 MHz
  GPU Utilization:   22%
  VRAM Utilization:  16%
  Memory Usage:      22.8 / 24.0 GB
  Applied Offsets:   GPU: 100 MHz, VRAM: 1500 MHz

GPU 2: NVIDIA GeForce RTX 3090
------------------------------------------------
  PCI Bus ID:        00000000:61:00.0
  VBIOS Version:     94.02.4B.00.0B
  Persistence Mode:  Enabled
  Core Temperature:  57°C
  Power Usage:       160W
  Current Power Limit: 400W
  Power Limits:      Default: 350W, Min: 100W, Max: 400W
  GPU Clock:         2040 MHz
  VRAM Clock:        10251 MHz
  GPU Utilization:   5%
  VRAM Utilization:  3%
  Memory Usage:      22.3 / 24.0 GB
  Applied Offsets:   GPU: 100 MHz, VRAM: 1500 MHz


Peer-to-Peer (P2P) Support Matrix:
=================================
GPU 0 -> GPU 1: Supported
GPU 0 -> GPU 2: Supported
GPU 1 -> GPU 0: Supported
GPU 1 -> GPU 2: Supported
GPU 2 -> GPU 0: Supported
GPU 2 -> GPU 1: Supported

a dodgy tool to generate the graphs:

Details

File: utils/generate_svgs.sh

#!/bin/bash

# Script to generate decode.svg and prefill.svg from benchmark logs
# Usage: ./generate_svgs.sh [layer_log] [graph_log]

LAYER_LOG="${1:-/opt/ubergarm/GLM-4.5-Air-GGUF/IQ1_KT/bench-sm-layer-f16.log}"
GRAPH_LOG="${2:-/opt/ubergarm/GLM-4.5-Air-GGUF/IQ1_KT/bench-sm-graph-f16.log}"

# Colors: red for layer, blackish for graph (as requested)
LAYER_COLOR="#ea4612"
GRAPH_COLOR="#333333"

if [ ! -f "$LAYER_LOG" ] || [ ! -f "$GRAPH_LOG" ]; then
    echo "Error: Log files not found!"
    exit 1
fi

create_chart() {
    local chart_type="$1"
    local col output_file title

    if [ "$chart_type" = "decode" ]; then
        col=8  # S_TG t/s column (after removing spaces)
        title="Decode Speed Comparison (S_TG t/s)"
        output_file="decode.svg"
    else
        col=6  # S_PP t/s column (after removing spaces)
        title="Prefill Speed Comparison (S_PP t/s)"
        output_file="prefill.svg"
    fi

    echo "Generating $output_file..."

    # Extract data - lines with numeric N_KV values in column 4
    awk -v col_num="$col" -F'|' '$4 ~ /^[[:space:]]*[0-9]/ {
        gsub(/ /, "");
        print $4, $(col_num)
    }' "$LAYER_LOG" | sort -n > "/tmp/layer_${chart_type}.dat"

    awk -v col_num="$col" -F'|' '$4 ~ /^[[:space:]]*[0-9]/ {
        gsub(/ /, "");
        print $4, $(col_num)
    }' "$GRAPH_LOG" | sort -n > "/tmp/graph_${chart_type}.dat"

    echo "  Layer data points: $(wc -l < "/tmp/layer_${chart_type}.dat")"
    echo "  Graph data points: $(wc -l < "/tmp/graph_${chart_type}.dat")"

    # Use environment variables to pass parameters
    export chart_type title layer_color graph_color output_file

    python3 << 'PYEOF'
import sys, os
import math

title = os.environ.get('title', 'Chart')
layer_color = os.environ.get('layer_color', '#ea4612')
graph_color = os.environ.get('graph_color', '#333333')
chart_type = os.environ.get('chart_type', 'decode')
output_file = os.environ.get('output_file', 'output.svg')

# Read layer data
layer_nkv = []
layer_speed = []
with open(f"/tmp/layer_{chart_type}.dat", "r") as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) >= 2:
            layer_nkv.append(float(parts[0]))
            layer_speed.append(float(parts[1]))

# Read graph data
graph_nkv = []
graph_speed = []
with open(f"/tmp/graph_{chart_type}.dat", "r") as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) >= 2:
            graph_nkv.append(float(parts[0]))
            graph_speed.append(float(parts[1]))

if not layer_nkv or not graph_nkv:
    print("Error: No valid data points found", file=sys.stderr)
    sys.exit(1)

# Calculate ranges
max_kv = max(max(layer_nkv), max(graph_nkv))
min_s = min(min(layer_speed), min(graph_speed))
max_s = max(max(layer_speed), max(graph_speed))
range_s = max_s - min_s

if range_s == 0:
    range_s = max_s * 0.1

# Add small padding to the speed range
min_s -= 0.05 * range_s
max_s += 0.05 * range_s
range_s = max_s - min_s

def scale_x(nkv):
    return 60 + (nkv / max_kv) * 720

def scale_y(speed):
    if range_s == 0:
        return 190
    return 40 + ((max_s - speed) / range_s) * 300

# Generate SVG to temp file
with open(f"/tmp/{output_file}", 'w') as svg_out:
    svg_out.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    svg_out.write('<svg width="800" height="400" viewBox="0 0 800 400" xmlns="http://www.w3.org/2000/svg">\n')

    # Background and title
    svg_out.write(f'  <rect width="800" height="400" fill="#f8f9fa"/>\n')
    svg_out.write(f'  <text x="400" y="25" text-anchor="middle" font-family="Arial, sans-serif" font-size="16" font-weight="bold" fill="#212529">{title}</text>\n')
    svg_out.write('  <text x="30" y="200" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" fill="#495057" transform="rotate(-90 30 200)">Speed (tokens/second)</text>\n')
    svg_out.write('  <text x="400" y="385" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" fill="#495057">N_KV (cache tokens)</text>\n')

    # Calculate Y-axis labels with intelligent spacing
    range_val = max_s - min_s

    # Determine appropriate step size
    if range_val <= 5:
        step = 1
    elif range_val <= 10:
        step = 2
    elif range_val <= 20:
        step = 4
    elif range_val <= 50:
        step = 8
    else:
        # For larger ranges, use steps of 16 or more
        base_step = max(1, int(range_val / 6))
        # Round to nearest nice number (10, 20, 25, 50, etc.)
        if base_step <= 5:
            step = 4
        elif base_step <= 15:
            step = 16
        else:
            step = max(8, int(base_step / 2) * 3)

    # Generate labels
    label_min = math.floor(min_s / step) * step
    label_max = math.ceil(max_s / step) * step
    labels = []

    v = label_min
    while v <= label_max:
        if min_s - step <= v <= max_s + step:  # Extended range for better coverage
            labels.append(v)
        v += step

    # Grid lines (horizontal at Y-axis label positions)
    svg_out.write('  <g stroke="#e9ecef" stroke-width="1">\n')

    # Horizontal grid lines
    for speed_val in labels:
        y_pos = scale_y(speed_val)
        if 40 <= y_pos <= 340:
            svg_out.write(f'    <line x1="60" y1="{y_pos:.1f}" x2="780" y2="{y_pos:.1f}"/>\n')

    # Vertical grid lines (keep original positions)
    svg_out.write('    <line x1="240" y1="40" x2="240" y2="340"/>\n')
    svg_out.write('    <line x1="420" y1="40" x2="420" y2="340"/>\n')
    svg_out.write('    <line x1="600" y1="40" x2="600" y2="340"/>\n')

    svg_out.write('  </g>\n')

    # Axes
    svg_out.write('  <g stroke="#212529" stroke-width="1.5">\n')
    svg_out.write('    <line x1="60" y1="340" x2="780" y2="340"/>\n')
    svg_out.write('    <line x1="60" y1="40" x2="60" y2="340"/>\n')
    svg_out.write('  </g>\n')

    # Plot border
    svg_out.write('  <rect x="60" y="40" width="720" height="300" fill="none" stroke="#6c757d" stroke-width="1.5" opacity="0.7"/>\n')

    # Axis labels
    svg_out.write('  <g font-family="Arial, sans-serif" font-size="10" fill="#495057">\n')

    # X-axis labels (N_KV values)
    svg_out.write('    <text x="60" y="360" text-anchor="middle">0</text>\n')

    if max_kv >= 10000:
        svg_out.write(f'    <text x="240" y="360" text-anchor="middle">{int(max_kv / 4 / 1000)}K</text>\n')
        svg_out.write(f'    <text x="420" y="360" text-anchor="middle">{int(max_kv / 2 / 1000)}K</text>\n')
        svg_out.write(f'    <text x="600" y="360" text-anchor="middle">{int(3 * max_kv / 4 / 1000)}K</text>\n')
        svg_out.write(f'    <text x="780" y="360" text-anchor="middle">{int(max_kv / 1000)}K</text>\n')
    else:
        svg_out.write(f'    <text x="240" y="360" text-anchor="middle">{int(max_kv / 4)}</text>\n')
        svg_out.write(f'    <text x="420" y="360" text-anchor="middle">{int(max_kv / 2)}</text>\n')
        svg_out.write(f'    <text x="600" y="360" text-anchor="middle">{int(3 * max_kv / 4)}</text>\n')
        svg_out.write(f'    <text x="780" y="360" text-anchor="middle">{int(max_kv)}</text>\n')

    # Y-axis labels
    for speed_val in labels:
        y_pos = scale_y(speed_val)
        if 40 <= y_pos <= 340:  # Only draw within plot area
            svg_out.write(f'    <text x="50" y="{y_pos:.1f}" text-anchor="end">{int(speed_val) if speed_val.is_integer() else speed_val}</text>\n')

    svg_out.write('  </g>\n')

    # Layer polyline
    points = []
    for i in range(len(layer_nkv)):
        points.append(f"{scale_x(layer_nkv[i]):.1f},{scale_y(layer_speed[i]):.1f}")
    svg_out.write(f'  <!-- layer -->\n')
    svg_out.write(f'  <polyline fill="none" stroke="{layer_color}" stroke-width="2.5" points="{" ".join(points)}"/>\n')

    # Graph polyline
    points = []
    for i in range(len(graph_nkv)):
        points.append(f"{scale_x(graph_nkv[i]):.1f},{scale_y(graph_speed[i]):.1f}")
    svg_out.write(f'  <!-- graph -->\n')
    svg_out.write(f'  <polyline fill="none" stroke="{graph_color}" stroke-width="2.5" points="{" ".join(points)}"/>\n')

    # Legend
    svg_out.write('  <g font-family="Arial, sans-serif" font-size="11">\n')
    svg_out.write(f'    <rect x="540" y="45" width="12" height="3" fill="{layer_color}" rx="1"/>\n')
    svg_out.write('    <text x="558" y="50" fill="#212529">layer</text>\n')
    svg_out.write(f'    <rect x="660" y="45" width="12" height="3" fill="{graph_color}" rx="1"/>\n')
    svg_out.write('    <text x="678" y="50" fill="#212529">graph</text>\n')
    svg_out.write('  </g>\n')

    svg_out.write('</svg>\n')
PYEOF

    if [ $? -eq 0 ]; then
        mv -f "/tmp/${output_file}" "$output_file"
        echo "  Done: $output_file ($(wc -l < /tmp/layer_${chart_type}.dat) data points, max_kv=$(tail -1 /tmp/layer_${chart_type}.dat | awk '{print $1}'))"
    fi
}

# Generate both charts
create_chart decode
create_chart prefill

rm -f /tmp/layer_*.dat /tmp/graph_*.dat 2>/dev/null
ls -lh decode.svg prefill.svg 2>/dev/null || echo "SVG files not created"
echo

@ikawrakow
Copy link
Copy Markdown
Owner Author

@magikRUKKOLA

Thank you for these. So, it looks like for short context one gains a bit by using 2 GPUs for split mode "graph". But for long context having all 3 GPUs in the graph split clearly helps.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 6, 2025

@ikawrakow

Does that mean the more the GPUs the better overall performance (with an exception of the short context of 12k or so as shown above)? Will it potentially scale to 4 GPUs or 8 GPUs ?

Can we expect such a support for other models like DeepSeek and K2 in the future?

@ikawrakow
Copy link
Copy Markdown
Owner Author

@ikawrakow

Does that mean the more the GPUs the better overall performance (with an exception of the short context of 12k or so as shown above)?

Can we expect such a support for other models like DeepSeek and K2 in the future?

It is a matter of the balance between reducing computation time by parallelism, and the added latency for synchronization and copying data between GPUs. We have seen for some people sm "graph" is already slower for 2 GPUs. In your case 3 GPUs becomes better at a not too long context to be useful in practice. Where the break even point for 4 GPUs will be will depend on the system. Btw, when you get the 4-GPU box ready, we can use tailscale for me to login remotely and experiment when you are not using it.

Concerning K2 and DeepSeek architecture: that is much harder, with not easy to predict outcome. I wrote more about that here. The DeepSeek self-attention is designed to have a slower performance degradation with increasing context length, but that makes it much harder to parallelize.

I have done the split mode "graph" for LlaMA and derivatives, Qwen3-MoE, and GLM-4.5/4.6/AIR. It is not hard to add other architectures, I just need a request from a user willing to test. I cannot download and test models for every arch supported by ik_llama.cpp, that's simply too much.

@magikRUKKOLA
Copy link
Copy Markdown

@ikawrakow

Btw, when you get the 4-GPU box ready, we can use tailscale for me to login remotely and experiment when you are not using it.

Okay good to know. Yeah I was thinking to create some pam.d so everything dumps to the storage and shutdowns when you logging in etc.

But man, it would take some time to build the water cooling properly. And in the meantime I have a bunch of risers laying around. I was thinking to use them to build a temporary quad-GPU rig when we are waiting for the water-cooling etc.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 6, 2025

So as an aside, I have revisited NCCL splits in exllama v3. One core is used per process and utilization is 99% with cards consuming about 240W each. On mistral 123B, I get over 450t/s prompt processing and about 16t/s output. The transfers reach about 5-6gb/s on all GPUs. So I am not constrained by PCIE traffic in the most likely case. I'll have to test llama-70b and see if I get similar speeds to what you all got for the prompt processing and do 2x vs 4x, etc. Apples to oranges, I know, but my point is that it is possible and a SW limitation.

If magik's new rig is also single single proc/numa, offer to test still stands. Tailscale let me login with github so if you guys find a way to share that, I'm open to it whenever you get the urge to play with numa.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 6, 2025

@Ph0rk0z

If magik's new rig is also single single proc/numa,

Yes, it is. Its AMD Threadripper PRO 3995wx.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z

Without having a corresponding performance data point with ik_llama.cpp, the 450 t/s PP or 16 t/s TG for Mistral 123B doesn't tell us much. Could you either run Mistral 123B with ik_llama.cpp, or run exllama v3 with Llama-70B? Thanks!

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 6, 2025

Yep, that's the plan. I have 70b llamas I can run.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z

Yep, that's the plan. I have 70b llamas I can run.

I'm dying to have the results of these benchmarks...

@magikRUKKOLA
Copy link
Copy Markdown

@ikawrakow

I'm dying to have the results of these benchmarks...

Air-cooled 4xRTX 3090 perhaps could be relevant in this situation? ( #1029 (comment) )

@ikawrakow
Copy link
Copy Markdown
Owner Author

@magikRUKKOLA

Yes, sure, the 4x3090 box is interesting to play with. But what I'm right now extremely curious to know is the speed of ik_llama.cpp vs exlama-v3 on @Ph0rk0z' system for the same model.

The 4x3090 box is ready? Give me instructions what I need to do to log in, along with time slots when I can use it.

@magikRUKKOLA
Copy link
Copy Markdown

@ikawrakow

The 4x3090 box is ready?

Well ... actually no. :) But all the hardware for it is just laying around hence the question.
I just needed the confirmation from you so now I can build it. It will take a couple of days max.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

So quick question. I have EXL3 qwen-72b and then GGUF qwen72b of similar filesize. Does graph mode work for that or only GLM and llama-2? There's also no sweep bench in exllama but I can take chats at various contexts and reroll with both. I have like 1k, 2k-4k, 22k conversations I can have it crunch from a fresh KV. Want to take out as many variables as possible without downloading the same model twice or you'll be waiting a while.

@ikawrakow
Copy link
Copy Markdown
Owner Author

Split mode "graph" does not work for a dense Qwen model (is it Qwen2 or Qwen3?). It doesn't take much to make it work, but you can start by comparing speed between EXL3 with TP and ik_llama.cpp with split mode "layer". We know already that on your box split mode "graph" is slower than split mode "layer". Hence, you are basically comparing the best EXL3 performance with best ik_llama.cpp performance. Ha, unless EXL3 with TP is also slower than without TP. Can you actually run EXL3 with and without TP? If yes, that on its own is an interesting data point.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

Yea, I can run it without TP and it's obviously slower there. Same thing, higher PP but lower TG. EXL3 has standard and NCCL TP as well.

exl3models

It's also relatively new and I have a lot of EXL2 models where they only have the low PP tp. The agatha command-r tune I have in both Q4KM and as of this morning 5.0bpw EXL3. Unless you think running L2 vs L3.1 is fine.

@ikawrakow
Copy link
Copy Markdown
Owner Author

Unless you think running L2 vs L3.1 is fine.

You want to run the greatest EXL, no? Else tomorrow you will tell me "Oh, we only compared with EXL2, but EXL3 is so much better". Or you will tell me "Oh, but we only compared against slow EXL3 TP, but the fast EXL3 TP is so much faster".

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

Ok, I ran what I have and here are the results. The kimi was their original reasoning version so sometimes it schizos and starts doing that. I don't think anyone supported the new cohere here or I'd do the agatha vs agatha for a more closer test with LARGE models.

4xGPU

       CUDA_VISIBLE_DEVICES=0,1,2,3 ./bin/llama-server \
    -m GGUF/Tess-v2.5.2-Qwen2-72B-GGUF/Tess-v2.5.2-Qwen2-72B-Q4_K_M-00001-of-00002.gguf \
    -t 48 \
    -c 65536 \
    -fa on \
    -ts 25,25,25,25 \
    -sm layer \
    -ctk q8_0 \
    -ctv q8_0 \
    --no-mmap \
    -ngl 99 \
    -gr \
    -cuda fusion=1

Tess-v2.5.2-Qwen2-72B-GGUF 50.7gb


print_timings] prompt eval time     =   31398.68 ms / 20531 tokens (    1.53 ms per token,   653.88 tokens per second)
print_timings] generation eval time =   52152.76 ms /   676 runs   (   77.15 ms per token,    12.96 tokens per second)
total time =   83551.43 ms 

print_timings] prompt eval time     =    4193.06 ms /  3127 tokens (    1.34 ms per token,   745.76 tokens per second)
print_timings] generation eval time =  128732.01 ms /  2048 runs   (   62.86 ms per token,    15.91 tokens per second)
total time =  132925.07 ms

print_timings] prompt eval time     =    1446.77 ms /   990 tokens (    1.46 ms per token,   684.29 tokens per second) 
print_timings] generation eval time =    8243.42 ms /   143 runs   (   57.65 ms per token,    17.35 tokens per second) 
total time =    9690.18 ms

print_timings] prompt eval time     =    2612.31 ms /  1897 tokens (    1.38 ms per token,   726.18 tokens per second)
print_timings] generation eval time =    2411.32 ms /    42 runs   (   57.41 ms per token,    17.42 tokens per second) 
total time =    5023.63 ms 

Kimi-Dev-72B-EXL3 5.0bpw 47.3 gb

No TP

439 tokens generated in 118.71 seconds (Queue: 0.0 s, Process: 0 cached 
tokens and 20531 new tokens at 366.76 T/s, Generate: 7.0 T/s, Context: 20531 tokens) 

570 tokens generated in 74.64 seconds (Queue: 0.0 s, Process: 9 cached 
tokens and 3799 new tokens at 418.39 T/s, Generate: 8.69 T/s, Context: 3808 tokens) 

58 tokens generated in 8.83 seconds (Queue: 0.0 s, Process: 9 cached tokens 
and 990 new tokens at 388.24 T/s, Generate: 9.24 T/s, Context: 999 tokens) 

38 tokens generated in 8.76 seconds (Queue: 0.0 s, Process: 9 cached tokens 
and 1882 new tokens at 411.82 T/s, Generate: 9.06 T/s, Context: 1891 tokens) 

Native TP

386 tokens generated in 76.36 seconds (Queue: 0.0 s, Process: 0 cached 
tokens and 20531 new tokens at 395.51 T/s, Generate: 15.79 T/s, Context: 20531 tokens)

2047 tokens generated in 114.87 seconds (Queue: 0.0 s, Process: 9 cached 
tokens and 3799 new tokens at 420.24 T/s, Generate: 19.34 T/s, Context: 3808 tokens) 

2047 tokens generated in 102.78 seconds (Queue: 0.0 s, Process: 9 cached 
tokens and 990 new tokens at 436.12 T/s, Generate: 20.37 T/s, Context: 999 tokens) 

10 tokens generated in 4.99 seconds (Queue: 0.0 s, Process: 9 cached tokens 
and 1882 new tokens at 418.22 T/s, Generate: 20.23 T/s, Context: 1891 tokens) 

NCCL TP


716 tokens generated in 77.55 seconds (Queue: 0.0 s, Process: 0 cached 
tokens and 20531 new tokens at 655.52 T/s, Generate: 15.49 T/s, Context: 20531 tokens) 

2047 tokens generated in 113.65 seconds (Queue: 0.0 s, Process: 9 cached 
tokens and 3799 new tokens at 725.0 T/s, Generate: 18.88 T/s, Context: 3808 tokens) 

36 tokens generated in 3.22 seconds (Queue: 0.0 s, Process: 9 cached tokens 
and 990 new tokens at 697.18 T/s, Generate: 19.98 T/s, Context: 999 tokens) 

13 tokens generated in 3.28 seconds (Queue: 0.0 s, Process: 9 cached tokens 
and 1882 new tokens at 718.32 T/s, Generate: 19.69 T/s, Context: 1891 tokens) 

@ikawrakow
Copy link
Copy Markdown
Owner Author

I see you use Q8_0 KV cache for ik_llama.cpp. Do you do the same with EXL3 ?

But apart from this, what I see is

  • ik_llama.cpp without TP wipes the floor with EXL3 without TP (nearly 2X faster for PP and TG)
  • With best TP, EXL3 is about the same as ik_llama.cpp without TP for PP
  • With best TP, EXL3 is about 15% faster than ik_llama.cpp without TP, but the EXL3 model is ~7% smaller, so difference will be more in the 10% range for the same size model, which is comparable to what you had as TG speedup from split mode "graph" for LlaMA-3-70B

So, with other words, your system is pretty bad at TP, but EXL3 being exceptionally slow without TP, you see a large benefit from TP there.

Or is your interpretation of these results different?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

You could say that.. but it's not true for all model architectures. I also wouldn't say my system "sucks" at TP. Many backends don't support it. Pcie 3.0 with fake nvlink and xeon only gets you so far. Apparently cohere is supported and this is a model that actually needs the GPUs. Should add that yea, i only use Q8 cache on both. Sometimes Q6 which finally made an entrance here.

    CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-server \
    -m /models/Agatha-111B-v1-Q4_K_L/Agatha-111B-v1-Q4_K_L-00001-of-00002.gguf \
    -t 48 \
    -c 32768 \
    -ts 24,24,24,24 \
    --numa distribute \
    -ngl 99 \
    -ctk q8_0 \
    -ctv q8_0 \
    -fa 1 \
    -sm layer \
    -gr \
    --no-mmap \
    -cuda fusion=1

Agatha-111B-v1-Q4_K_L 67.9GB

print_timings] prompt eval time     =   43220.59 ms / 20576 tokens (    2.10 ms per token,   476.07 tokens per second) 
print_timings] generation eval time =   67435.15 ms /   750 runs   (   89.91 ms per token,    11.12 tokens per second)
print_timings]           total time =  110655.74 ms
    
print_timings] prompt eval time     =    8373.28 ms /  4062 tokens (    2.06 ms per token,   485.11 tokens per second) 
print_timings] generation eval time =   46668.34 ms /   555 runs   (   84.09 ms per token,    11.89 tokens per second) 
print_timings]           total time =   55041.62 ms

print_timings] prompt eval time     =    2287.41 ms /  1044 tokens (    2.19 ms per token,   456.41 tokens per second) 
print_timings] generation eval time =    3722.24 ms /    48 runs   (   77.55 ms per token,    12.90 tokens per second) 
print_timings]           total time =    6009.66 ms 

print_timings] prompt eval time     =    4263.96 ms /  2036 tokens (    2.09 ms per token,   477.49 tokens per second)
print_timings] generation eval time =     332.56 ms /     5 runs   (   66.51 ms per token,    15.03 tokens per second)
print_timings]           total time =    4596.53 ms 

Agatha-111B-v1-EXL3 5.0 BPW 76.1GB

Native TP

658 tokens generated in 83.95 seconds (Queue: 0.0 s, Process: 0 cached 
tokens and 21193 new tokens at 522.0 T/s, Generate: 15.18 T/s, Context: 21193 tokens) 

639 tokens generated in 44.1 seconds (Queue: 0.0 s, Process: 530 cached 
tokens and 4066 new tokens at 528.05 T/s, Generate: 17.55 T/s, Context: 4596 tokens) 

80 tokens generated in 6.36 seconds (Queue: 0.0 s, Process: 530 cached 
tokens and 1047 new tokens at 510.73 T/s, Generate: 18.58 T/s, Context: 1577 tokens)

6 tokens generated in 4.18 seconds (Queue: 0.0 s, Process: 530 cached tokens
and 2035 new tokens at 528.57 T/s, Generate: 18.27 T/s, Context: 2565 tokens) 

NCCL TP

788 tokens generated in 82.69 seconds (Queue: 0.0 s, Process: 0 cached 
tokens and 21193 new tokens at 705.26 T/s, Generate: 14.97 T/s, Context: 21193 tokens) 

614 tokens generated in 41.07 seconds (Queue: 0.0 s, Process: 530 cached 
tokens and 4065 new tokens at 727.19 T/s, Generate: 17.3 T/s, Context: 4595 tokens) 

59 tokens generated in 4.76 seconds (Queue: 0.0 s, Process: 530 cached 
tokens and 1047 new tokens at 684.31 T/s, Generate: 18.26 T/s, Context: 1577 tokens) 

12 tokens generated in 3.5 seconds (Queue: 0.0 s, Process: 530 cached tokens
and 2035 new tokens at 719.08 T/s, Generate: 18.05 T/s, Context: 2565 tokens)

@ikawrakow
Copy link
Copy Markdown
Owner Author

Sometimes Q6 which finally made an entrance here.

Q6_0 KV cache has been available here since October 2024.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

I didn't see anyone use it on mainline and remember the quirk where key had to be at least q8_0 or the ppl would start to get bad. I think value could be Q4_0 per the original PR.

Can say llama.cpp in general has come far in terms of speeds and especially in this fork. If I had any mistrals in GGUF I could test that too.

@ikawrakow
Copy link
Copy Markdown
Owner Author

I didn't see anyone use it on mainline and remember the quirk where key had to be at least q8_0 or the ppl would start to get bad. I think value could be Q4_0 per the original PR.

This is not llama.cpp. PPL starts going bad at Q5, which is the next step mainline llama.cpp has after Q8. ik_llama.cpp had Q6 for a long time, and it has been the favorite KV cache type for people paying attention and not confusing ik_llama.cpp with llama.cpp (for instance, @Nexesenex).

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 8, 2025

His comment is the only reason I knew it existed here.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Dec 8, 2025

@ikawrakow

I cannot download and test models for every arch supported by ik_llama.cpp, that's simply too much.

Are there any LLMs you want to be pre-downloaded? The machine in question has 4TB storage.

[EDIT]: preliminary results. 4 GPUs had been connected via the risers (one gpu with two risers, for an extra extension). I should possibly install some Noctua Industrial Fans etc. Ha.
The box likely will be available (via the remote access) starting from tomorrow or so.

[EDIT2]:

The 4x3090 box is ready? Give me instructions what I need to do to log in, along with time slots when I can use it.

So basically its ready.

 /opt/nvidia/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/build/p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 3090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 832.00  11.16  10.63  11.10 
     1  11.31 833.78  10.48  10.77 
     2  10.09  10.25 829.79   9.85 
     3  11.11  11.18  10.67 832.89 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3 
     0 834.67  26.07  25.08  25.84 
     1  26.39 902.66  25.15  25.61 
     2  22.60  22.24 821.50  23.00 
     3  25.97  26.25  25.70 837.35 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 840.51  14.67  15.24  16.49 
     1  14.63 907.64  15.05  16.57 
     2  15.63  15.51 823.23  13.90 
     3  16.38  16.15  14.45 908.69 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 907.38  50.76  46.81  51.21 
     1  51.21 906.59  49.70  51.22 
     2  49.79  49.82 838.22  50.29 
     3  50.65  51.05  49.34 907.11 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3 
     0   1.48  15.60  16.77  15.87 
     1  15.06   1.44  20.68  16.21 
     2  18.07  20.03   1.65  20.10 
     3  15.16  17.24  19.33   1.49 

   CPU     0      1      2      3 
     0   3.20   9.56  27.69   9.30 
     1   9.72   3.08   9.34   9.35 
     2   9.42   9.33   3.06   9.23 
     3   9.51   9.54   9.27   3.09 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3 
     0   1.47   1.11   1.16   1.16 
     1   1.12   1.45   1.16   1.14 
     2   1.83   1.29   1.74   1.84 
     3   1.22   1.34   1.13   1.48 

   CPU     0      1      2      3 
     0   3.18   2.63   2.67   2.64 
     1   2.75   3.16   2.91   2.65 
     2   2.69   2.67   3.16   2.62 
     3   2.91   2.66   2.67   3.17 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Lets agree upon the access protocol. I suggested somewhere around here that you could use like Tailscale or alternatively, we can provide a VPS the sole purpose of which is to provide the dedicated ip address and from there you'll have another SSH session (tunnel) via some other means to the server in question.
What option would you prefer?

@ikawrakow

along with time slots when I can use it.

I don't think its necessary. I have a bunch of other boxes I can use in case you'd decide to use the quad-box. No worries at all. You'll have the root access. In case something bothering you (something is running etc.) -- just terminate it etc.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 9, 2025

Cool to see that test. Your b/w is basically doubled but our latencies are fairly comparable. There is another test, all to all where the speeds fall much more.

@magikRUKKOLA
Copy link
Copy Markdown

@Ph0rk0z

Your b/w is basically doubled but our latencies are fairly comparable.

Can you drop your results (of the p2pBandwidthLatencyTest from cuda-toolkit), huh?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 10, 2025

Man.. I thought I had already pasted it in here but maybe not? aikitoria/open-gpu-kernel-modules@7c82991#r165839881

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants