Skip to content

ggml: backend-agnostic tensor parallelism#19378

Open
JohannesGaessler wants to merge 45 commits intoggml-org:masterfrom
JohannesGaessler:ggml-meta-backend-8
Open

ggml: backend-agnostic tensor parallelism#19378
JohannesGaessler wants to merge 45 commits intoggml-org:masterfrom
JohannesGaessler:ggml-meta-backend-8

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Feb 5, 2026

This PR adds support for backend-agnostic tensor parallelism, enabled via specifying --split-mode tensor. This is done by adding a new "meta" backend that internally wraps multiple "simple" backends but can be used in the same way as a regular ggml backend.

ggml Backend Interface Changes

This PR extends the ggml backend interface with some new functions for tensor copying:

  • set_tensor_2d_async/get_tensor_2d_async which are equivalent to memcpy2DAsync in CUDA. This is not needed for the computation of the meta backend itself but rather for setting/getting weights or the output. Currently not implemented, as a workaround the one-dimensional version is used in a loop.
  • shfl_tensor_async to allow two ggml backends to exchange two tensors and to synchronize on the completion of the exchange. As a fallback cpy_tensor_async can be used but this has a higher latency because the copy in one direction can only start once the one in the other direction has finished. Needed for a generic AllReduce between ggml backends. Implemented.
  • allreduce_tensor_async to allow ggml backends to specify a backend-specific way to do an AllReduce operation. Intended to be used for NCCL support in cooperation with @gaugarg-nv . Not yet implemented.

@slaren please provide feedback regarding whether you agree that these operations should be in the ggml backend interface. For context, all of them are optional and can use existing operations as a fallback.

Meta Backend Implementation Details

The meta backend implements an entire ggml backend stack starting from a meta device. The meta device is created from multiple simple backend devices as well as a function to determine how the data should be split across devices ("split states"). Backend buffer types, buffers, and backends are created as per usual. When calling ggml_backend_graph_compute the code infers the split states of the nodes in the compute graph based on the split states assigned for the weights/kv cache. The basic pattern is to make all tensors mirrored by default. For the weight matrices, do a split in dimension 1, then a split in dimension 0, then an AllReduce. For a transformer this means two AllReduce operations, one after the attention and one after the FFN. The attention is effectively split by dimension 0, which equates to a split by attention head.

An generic AllReduce operation is performed in the meta backend by splitting the graph into subgraphs. After a subgraph is executed, call shfl_tensor_async to make backends exchange partial results, and then have them execute auxiliary graphs that contain only a GGML_ADD operation to combine the results.

The memory allocation for the compute graph is rather tricky - the way I solved it is to allocate the memory for the meta backend as per usual and to then transplant the calculated addresses relative to the backend buffer base pointer to the underlying simple backends. Because the simple tensors only require a fraction of the full memory this yields correct results, though it does result in overallocation for the compute graphs. For the weights/kv cache the memory allocation for the meta backend is done via a new function ggml_backend_meta_alloc_ctx_tensors_from_buft to prevent duplicated weights (which are much larger in size). I'm not yet sure what the best approach will be long-term, I think the graph allocation code in ggml-alloc.c will need to be adjusted.

Current Issues/Limitations

  • Only 1 or 2 GPUs are supported. Note: 1 GPU is not actually any faster, this is only useful for testing whether the code works correctly.
  • All GPUs must have an equal share of the data, --tensor-split has no effect.
  • Only dense models are supported. The LLaMA 3 models seem to be working correctly, I have not yet tested others.
  • Support for llama_params_fit is not implemented so the context size has to be set manually.
  • Without FlashAttention the code will probably crash because some transition between split states is not yet implemented.
  • In principle all backends should work. CUDA does in my testing, Vulkan however does not. I think there may be some issues with deadlock between the GPUs. @jeffbolznv @0cc4m if you could take a look it would be appreciated.
  • Memory for the ggml contexts is being overallocated.
  • Performance is (presumably) still suboptimal vs. NCCL.
  • I'm currently using tensor names to determine how to split individual tensors. I think it would be preferable to use some sort of enum instead (which we already seem to have for loading tensors). This should also be used for llama_params_fit.
  • I'm currently setting ggml_tensor::data to dummy values since that is what is checked in ggml-alloc.c to determine whether or not a tensor is considered allocated. This dummy value should never actually be used for any computations but I don't consider this a good solution.
  • I'm not putting meta devices into the ggml backend device registry (which I think is correct).

Performance

LLaMA 3 on 2x RTX 4090
model test t/s -sm layer t/s -sm row t/s -sm tensor
llama 8B Q4_0 pp512 12550.97 2997.41 6305.68
llama 8B Q4_0 pp2048 18788.11 2970.83 6300.12
llama 8B Q4_0 tg128 175.43 67.00 101.98
llama 8B Q4_0 pp512 @ d32768 5099.65 2200.50 4137.73
llama 8B Q4_0 pp2048 @ d32768 7925.21 2242.84 4337.54
llama 8B Q4_0 tg128 @ d32768 96.78 49.76 102.81
llama 8B Q4_0 pp512 @ d65536 3154.69 1748.05 3139.40
llama 8B Q4_0 pp2048 @ d65536 4996.19 1806.82 3404.70
llama 8B Q4_0 tg128 @ d65536 67.11 40.27 83.62
llama 8B Q4_0 pp512 @ d131072 1800.72 1243.57 2152.01
llama 8B Q4_0 pp2048 @ d131072 2867.83 1294.61 2238.01
llama 8B Q4_0 tg128 @ d131072 41.66 29.19 59.53
llama 8B F16 pp512 10578.54 1591.55 5950.44
llama 8B F16 pp2048 15890.45 1581.04 6005.96
llama 8B F16 tg128 60.07 46.32 70.64
llama 8B F16 pp512 @ d32768 4745.19 1330.02 4032.19
llama 8B F16 pp2048 @ d32768 **7329.28 ** 1347.02 4279.02
llama 8B F16 tg128 @ d32768 47.03 38.02 64.63
llama 8B F16 pp512 @ d65536 3033.10 1154.54 3100.10
llama 8B F16 pp2048 @ d65536 4782.02 1176.88 3341.03
llama 8B F16 tg128 @ d65536 38.72 32.19 56.48
llama 8B F16 pp512 @ d131072 1735.63 905.39 2090.03
llama 8B F16 pp2048 @ d131072 2782.42 936.55 2288.65
llama 8B F16 tg128 @ d131072 28.60 24.79 43.32
llama 70B Q3_K - Small pp512 1287.50 582.81 1072.80
llama 70B Q3_K - Small pp2048 1954.83 590.45 1069.28
llama 70B Q3_K - Small tg128 27.97 20.36 29.65
llama 70B Q3_K - Small pp512 @ d32768 776.80 458.17 812.77
llama 70B Q3_K - Small pp2048 @ d32768 1185.82 459.43 824.99
llama 70B Q3_K - Small tg128 @ d32768 20.85 16.19 29.39

Generally speaking it can be observed that parallelizing larger models has better performance than parallelizing smaller models. Similarly, parallelizing the model becomes more worthwhile as the context depth increases. This makes sense as both of these result in a larger workload per GPU vs. the overhead from parallelization. Token generation benefits more from parallelization than prompt processing because the amount of data that needs to be transferred between GPUs is proportional to batch size - long-term it may make sense to implement support for FP16/BF16 compute types which would count the I/O in half vs. FP32. For pp512 pipeline parallelism is effectively disabled while for pp2048 it's enabled. With pipeline parallelism -sm layer is still faster than -sm tensor even at high context depths.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator labels Feb 5, 2026
@jeffbolznv
Copy link
Copy Markdown
Contributor

In principle all backends should work. CUDA does in my testing, Vulkan however does not. I think there may be some issues with deadlock between the GPUs. @jeffbolznv @0cc4m if you could take a look it would be appreciated.

I'm not seeing a deadlock, just a crash in the driver with an invalid descriptor. I ran llama-bench.exe -fa 1 -p 512 -n 0 -m c:\models\llama-2-7b.Q4_0.gguf -sm tensor

Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] | MessageID = 0xc23dafe5
vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[2].offset (18362368) is greater than or equal to buffer size (8388608).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)

-		dst_buf	{buffer=shared_ptr {buffer={m_buffer=0x0000be00000000be {...} } device_memory={m_deviceMemory=0x0000bf00000000bf {...} } ...} [0x00000003 strong refs] [make_shared] ...}	vk_subbuffer
+		buffer	shared_ptr {buffer={m_buffer=0x0000be00000000be {...} } device_memory={m_deviceMemory=0x0000bf00000000bf {...} } ...} [0x00000003 strong refs] [make_shared]	std::shared_ptr<vk_buffer_struct>
		offset	0x0000000001183000	unsigned __int64
		size	0x0000000000800000	unsigned __int64
-		dst	0x0000018f98716fd0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f98385c80 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...} ...}	ggml_tensor *
		type	GGML_TYPE_F32 (0x00000000)	ggml_type
+		buffer	0x0000018f98385c80 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...}	ggml_backend_buffer *
+		ne	0x0000018f98716fe0 {0x0000000000001000, 0x0000000000000200, 0x0000000000000001, 0x0000000000000001}	__int64[0x00000004]
+		nb	0x0000018f98717000 {0x0000000000000004, 0x0000000000004000, 0x0000000000800000, 0x0000000000800000}	unsigned __int64[0x00000004]
		op	GGML_OP_ADD (0x00000002)	ggml_op
+		op_params	0x0000018f98717024 {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, ...}	int[0x00000010]
		flags	0x00000010	int
+		src	0x0000018f98717068 {0x00000191a1b1d1b0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f97d4f130 {iface=...} ...}, ...}	ggml_tensor *[0x0000000a]
-		view_src	0x00000191a1b1d1b0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f97d4f130 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...} ...}	ggml_tensor *
		type	GGML_TYPE_F32 (0x00000000)	ggml_type
+		buffer	0x0000018f97d4f130 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...}	ggml_backend_buffer *
+		ne	0x00000191a1b1d1c0 {0x0000000000001000, 0x0000000000000200, 0x0000000000000001, 0x0000000000000001}	__int64[0x00000004]
+		nb	0x00000191a1b1d1e0 {0x0000000000000004, 0x0000000000004000, 0x0000000000800000, 0x0000000000800000}	unsigned __int64[0x00000004]
		op	GGML_OP_MUL_MAT (0x0000001d)	ggml_op
+		op_params	0x00000191a1b1d204 {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, ...}	int[0x00000010]
		flags	0x00000010	int
+		src	0x00000191a1b1d248 {0x0000018f9f29d940 {type=GGML_TYPE_Q4_0 (0x00000002) buffer=0x0000018f97d4f670 {...} ...}, ...}	ggml_tensor *[0x0000000a]
+		view_src	0x0000000000000000 <NULL>	ggml_tensor *
		view_offs	0x0000000000000000	unsigned __int64
		data	0x0000000001184000	void *
+		name	0x00000191a1b1d2b0 "attn_out-0"	char[0x00000040]
		extra	0x0000000000000000	void *
+		padding	0x00000191a1b1d2f8 ""	char[0x00000008]
		view_offs	0x0000000000000000	unsigned __int64
		data	0x0000000001184000	void *
+		name	0x0000018f987170d0 "attn_out-0 (view)"	char[0x00000040]
		extra	0x0000000000000000	void *
+		padding	0x0000018f98717118 ""	char[0x00000008]

It seems like tensor->data (and tensor->view_src->data) are too large. I haven't debugged further.

@jacekpoplawski
Copy link
Copy Markdown
Contributor

works for me on 2x3090 for llama 3 8B and Mistral Nemo 12B

on Devstral I have OOM (expected because model size?)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30720.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 32212254720

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bdaa8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp", line=119, fmt=0x7ffff77bd88f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff773565b in ggml_backend_buffer_get_size (buffer=0x0) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:119
#7  0x00007ffff7743b0e in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x555557dc41f0, buft=0x55555b032b18) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:638
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x555557dc41f0, buft=0x55555b032b18) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6f3d6c2 in llama_kv_cache::llama_kv_cache (this=0x555556363770, model=..., type_k=GGML_TYPE_F16, type_v=GGML_TYPE_F16, v_trans=false, offload=true, unified=true, kv_size=393216,
    n_seq_max=4, n_pad=1, n_swa=0, swa_type=LLAMA_SWA_TYPE_NONE, filter=..., reuse=...) at /home/jacek/git/llama.cpp/src/llama-kv-cache.cpp:190
#10 0x00007ffff700faec in llama_model::create_memory (this=0x555556357ea0, params=..., cparams=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7617
#11 0x00007ffff6ed008f in llama_context::llama_context (this=0x555557a65260, model=..., params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:274
#12 0x00007ffff6edd53c in llama_init_from_model (model=0x555556357ea0, params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:3046
#13 0x00005555558f4e24 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1183
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636eba0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79b0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0c8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)
but on MInistral 14B too
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20480.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 21474836480

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bdaa8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp", line=119, fmt=0x7ffff77bd88f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff773565b in ggml_backend_buffer_get_size (buffer=0x0) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:119
#7  0x00007ffff7743b0e in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x555557e339e0, buft=0x55555abb40b8) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:638
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x555557e339e0, buft=0x55555abb40b8) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6f3d6c2 in llama_kv_cache::llama_kv_cache (this=0x55555abb9470, model=..., type_k=GGML_TYPE_F16, type_v=GGML_TYPE_F16, v_trans=false, offload=true, unified=true, kv_size=262144,
    n_seq_max=4, n_pad=1, n_swa=0, swa_type=LLAMA_SWA_TYPE_NONE, filter=..., reuse=...) at /home/jacek/git/llama.cpp/src/llama-kv-cache.cpp:190
#10 0x00007ffff700faec in llama_model::create_memory (this=0x555556357e50, params=..., cparams=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7617
#11 0x00007ffff6ed008f in llama_context::llama_context (this=0x555557a65140, model=..., params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:274
#12 0x00007ffff6edd53c in llama_init_from_model (model=0x555556357e50, params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:3046
#13 0x00005555558f4e24 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1183
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636eb30, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79b0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0c8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)
qwen 4B has a different issue
/home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:386: GGML_ASSERT(ne[split_dim] % n_simple_bufs == 0) failed

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bedd8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp", line=386, fmt=0x7ffff77beb9f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff7741e84 in ggml_backend_meta_buffer_init_tensor (buffer=0x55555a4421d0, tensor=0x55555a46df70) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:386
#7  0x00007ffff7743a52 in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x55555a4422f0, buft=0x55555a426188) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:632
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x55555a4422f0, buft=0x55555a426188) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6fff452 in llama_model::load_tensors (this=0x555556357e50, ml=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7055
#10 0x00007ffff6e557fc in llama_model_load (fname="/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", splits=std::vector of length 0, capacity 0, model=..., params=...)
    at /home/jacek/git/llama.cpp/src/llama.cpp:876
#11 0x00007ffff6e56ce3 in llama_model_load_from_file_impl (path_model="/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", splits=std::vector of length 0, capacity 0, params=...)
    at /home/jacek/git/llama.cpp/src/llama.cpp:1069
#12 0x00007ffff6e56fe3 in llama_model_load_from_file (path_model=0x555556367610 "/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", params=...) at /home/jacek/git/llama.cpp/src/llama.cpp:1096
#13 0x00005555558f46c9 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1107
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636ead0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79d0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0e8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)

@DocShotgun
Copy link
Copy Markdown
Contributor

Interesting. If the CPU backend is able to be virtualized into multiple devices as described here, would it be possible to allow multiple NUMA nodes to be parallelized?

@gopinath87607
Copy link
Copy Markdown

will this pr solve if we use multiple rpc connected to gpu with cpu and when we use cpu moe flag ?

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@jacekpoplawski the combination of --split-mode tensor and llama_params_fit is not implemented so you'll have to set the context size manually if you didn't already.

@DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do.

@gopinath87607 I don't understand what you mean.

@FullstackSensei
Copy link
Copy Markdown

@DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do.

If you're looking for a DDR4 platform, I might be able to help with that, but unfortunately, not so much with RAM for the system. I'm also in Germany.

Wouldn't mind giving you access to my systems if you want. Have two dual Xeon systems one with P40s and the other with Mi50s.

Would be very happy to help either way.

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Feb 6, 2026

Started doing some initial tests to get familiar with the changes. I'm using virtual Metal devices and things appear to be mostly working - e.g. seeing graph execution on both devices, llama-perplexity produces the same result with 2 devices.

However, the following command does not produce identical results on each run:

GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor

I think this either means that there could be a problem with the fallback for the missing backend API, or that I could have made an error in the implementation of the events and the async copy in Metal. Will investigate more.

For context, all of them are optional and can use existing operations as a fallback.

In the meantime, @JohannesGaessler do you confirm that the command above has deterministic results on your end? Also, if is it deterministic with the fallback calls?

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

JohannesGaessler commented Feb 6, 2026

Generally speaking you cannot expect bit-for-bit identical results if you split the computation across multiple virtual devices. The order in which floats are summed up will be different which will in turn change the rounding error. If I run llama-perplexity using LLaMA 3 8b f16 I get 6.2560 for -sm layer and 6.2556 for -sm tensor. It's of course still possible that there are bugs on top of that, I would just say changes to the results are expected. Long-term I think we should test this new code in the same way we test ggml op fusion in test-backend-ops.

If you use -sm tensor with only a single GPU the executed ops should be the exact same and the result should be bit-for-bit identical to -sm layer.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization.

@ggerganov
Copy link
Copy Markdown
Member

Yes, I understand that 1GPU vs 2GPU will not be bit-for-bit identical. What I mean is that in my test, running the command with 2 GPUs several times produces non-deterministic results from one run to the other:

GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor

# run 1
I believe the meaning of life is to find your gift. The purpose of life is to give it away.
I believe that the meaning of life is to find your gift. The purpose of life

# run 2
I believe the meaning of life is to find your gift. The purpose of life is to give it away. To give it away, you have to find it. [end of text]

# run 3
I believe the meaning of life is to be happy. I believe that happiness is the only thing that matters. I believe that happiness is the only thing that matters. I believe that happiness is the

Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization.

Yes, seems like a synchronization issue. Was wondering if you observe it on your end with and without the fallback backend API. This will give me indication where to look for the issue.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

With the command you posted (minus the GGML_METAL_DEVICES=2) I get deterministic results on 2x RTX 4090, both with and without the fallback for shfl_tensor_async. The output of llama-perplexity is bit-for-bit identical with vs. without the fallback. So presumably either there is a bug with Metal synchronization or I made assumptions about the behavior of ggml backends that are correct for CUDA but not universally.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I pushed a rebased version with the recent improvements by @gaugarg-nv as well as more comprehensive tests in test-llama-archs.

@digitalscream
Copy link
Copy Markdown

Not sure what the etiquette is here for folk who can't contribute meaningfully to the technical side of things, but here's some testing from the AMD/Vulkan world on a pair of R9700s:

./llama-bench -m /opt/working/models.2/llama-2-7b-chat.Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048 -n 128 -r 2 -sm layer,tensor
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |           pp512 |      4683.52 ± 52.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |          pp1024 |     5778.65 ± 981.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |          pp2048 |      7473.95 ± 18.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |           tg128 |        117.84 ± 0.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |           pp512 |       1743.54 ± 0.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |          pp1024 |       1727.39 ± 0.78 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |          pp2048 |       1708.89 ± 0.86 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |           tg128 |         51.44 ± 0.24 |
./llama-bench -m /opt/working/models.1/Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048 -n 128 -r 2 -sm layer,tensor
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |           pp512 |      3819.46 ± 43.47 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |          pp1024 |     4891.56 ± 546.52 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |          pp2048 |      5820.59 ± 19.19 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |           tg128 |        150.67 ± 0.06 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |           pp512 |       1692.37 ± 7.81 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |          pp1024 |       1668.51 ± 9.22 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |          pp2048 |       1645.70 ± 3.88 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |           tg128 |         35.91 ± 0.27 |

Dense: -73% PP, -56% TG
MoE: -77% PP, -72% TG

Is an optimisation pass planned for Vulkan, or is it coming in a later PR to avoid holding this one up?

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I am one of the CUDA/ROCm maintainers so those are the backends I'm targeting. I don't know what the exact plans of the Vulkan maintainers are.

@digitalscream
Copy link
Copy Markdown

I am one of the CUDA/ROCm maintainers so those are the backends I'm targeting. I don't know what the exact plans of the Vulkan maintainers are.

Fair enough - apologies, I shall step out of the way then :)

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Mar 26, 2026

Vulkan can do device to device dma via device groups but Vulkan implementations have spoty support for this and the vulkan backed always goes via the cpu.

This is probubly the reason for the poor performance, @0cc4m knows exact details

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 26, 2026

I don't know what it is exactly, I'd have to profile it.

ggerganov and others added 3 commits March 26, 2026 14:04
* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <[email protected]>

* cont : fix log type

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs examples ggml changes relating to the ggml tensor library for machine learning Hexagon IBM zDNN issues specific to IBM zDNN Accelerator Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.