ggml: backend-agnostic tensor parallelism by JohannesGaessler · Pull Request #19378 · ggml-org/llama.cpp

JohannesGaessler · 2026-02-05T22:30:07Z

This PR adds support for backend-agnostic tensor parallelism, enabled via specifying --split-mode tensor. This is done by adding a new "meta" backend that internally wraps multiple "simple" backends but can be used in the same way as a regular ggml backend.

ggml Backend Interface Changes

This PR extends the ggml backend interface with some new functions for tensor copying:

set_tensor_2d_async/get_tensor_2d_async which are equivalent to memcpy2DAsync in CUDA. This is not needed for the computation of the meta backend itself but rather for setting/getting weights or the output. Currently not implemented, as a workaround the one-dimensional version is used in a loop.
shfl_tensor_async to allow two ggml backends to exchange two tensors and to synchronize on the completion of the exchange. As a fallback cpy_tensor_async can be used but this has a higher latency because the copy in one direction can only start once the one in the other direction has finished. Needed for a generic AllReduce between ggml backends. Implemented.
allreduce_tensor_async to allow ggml backends to specify a backend-specific way to do an AllReduce operation. Intended to be used for NCCL support in cooperation with @gaugarg-nv . Not yet implemented.

@slaren please provide feedback regarding whether you agree that these operations should be in the ggml backend interface. For context, all of them are optional and can use existing operations as a fallback.

Meta Backend Implementation Details

The meta backend implements an entire ggml backend stack starting from a meta device. The meta device is created from multiple simple backend devices as well as a function to determine how the data should be split across devices ("split states"). Backend buffer types, buffers, and backends are created as per usual. When calling ggml_backend_graph_compute the code infers the split states of the nodes in the compute graph based on the split states assigned for the weights/kv cache. The basic pattern is to make all tensors mirrored by default. For the weight matrices, do a split in dimension 1, then a split in dimension 0, then an AllReduce. For a transformer this means two AllReduce operations, one after the attention and one after the FFN. The attention is effectively split by dimension 0, which equates to a split by attention head.

An generic AllReduce operation is performed in the meta backend by splitting the graph into subgraphs. After a subgraph is executed, call shfl_tensor_async to make backends exchange partial results, and then have them execute auxiliary graphs that contain only a GGML_ADD operation to combine the results.

The memory allocation for the compute graph is rather tricky - the way I solved it is to allocate the memory for the meta backend as per usual and to then transplant the calculated addresses relative to the backend buffer base pointer to the underlying simple backends. Because the simple tensors only require a fraction of the full memory this yields correct results, though it does result in overallocation for the compute graphs. For the weights/kv cache the memory allocation for the meta backend is done via a new function ggml_backend_meta_alloc_ctx_tensors_from_buft to prevent duplicated weights (which are much larger in size). I'm not yet sure what the best approach will be long-term, I think the graph allocation code in ggml-alloc.c will need to be adjusted.

Current Issues/Limitations

Only 1 or 2 GPUs are supported. Note: 1 GPU is not actually any faster, this is only useful for testing whether the code works correctly.
All GPUs must have an equal share of the data, --tensor-split has no effect.
Only dense models are supported. The LLaMA 3 models seem to be working correctly, I have not yet tested others.
Support for llama_params_fit is not implemented so the context size has to be set manually.
Without FlashAttention the code will probably crash because some transition between split states is not yet implemented.
In principle all backends should work. CUDA does in my testing, Vulkan however does not. I think there may be some issues with deadlock between the GPUs. @jeffbolznv @0cc4m if you could take a look it would be appreciated.
Memory for the ggml contexts is being overallocated.
Performance is (presumably) still suboptimal vs. NCCL.
I'm currently using tensor names to determine how to split individual tensors. I think it would be preferable to use some sort of enum instead (which we already seem to have for loading tensors). This should also be used for llama_params_fit.
I'm currently setting ggml_tensor::data to dummy values since that is what is checked in ggml-alloc.c to determine whether or not a tensor is considered allocated. This dummy value should never actually be used for any computations but I don't consider this a good solution.
I'm not putting meta devices into the ggml backend device registry (which I think is correct).

Performance

LLaMA 3 on 2x RTX 4090

model	test	t/s -sm layer	t/s -sm row	t/s -sm tensor
llama 8B Q4_0	pp512	12550.97	2997.41	6305.68
llama 8B Q4_0	pp2048	18788.11	2970.83	6300.12
llama 8B Q4_0	tg128	175.43	67.00	101.98
llama 8B Q4_0	pp512 @ d32768	5099.65	2200.50	4137.73
llama 8B Q4_0	pp2048 @ d32768	7925.21	2242.84	4337.54
llama 8B Q4_0	tg128 @ d32768	96.78	49.76	102.81
llama 8B Q4_0	pp512 @ d65536	3154.69	1748.05	3139.40
llama 8B Q4_0	pp2048 @ d65536	4996.19	1806.82	3404.70
llama 8B Q4_0	tg128 @ d65536	67.11	40.27	83.62
llama 8B Q4_0	pp512 @ d131072	1800.72	1243.57	2152.01
llama 8B Q4_0	pp2048 @ d131072	2867.83	1294.61	2238.01
llama 8B Q4_0	tg128 @ d131072	41.66	29.19	59.53
llama 8B F16	pp512	10578.54	1591.55	5950.44
llama 8B F16	pp2048	15890.45	1581.04	6005.96
llama 8B F16	tg128	60.07	46.32	70.64
llama 8B F16	pp512 @ d32768	4745.19	1330.02	4032.19
llama 8B F16	pp2048 @ d32768	7329.28	1347.02	4279.02
llama 8B F16	tg128 @ d32768	47.03	38.02	64.63
llama 8B F16	pp512 @ d65536	3033.10	1154.54	3100.10
llama 8B F16	pp2048 @ d65536	4782.02	1176.88	3341.03
llama 8B F16	tg128 @ d65536	38.72	32.19	56.48
llama 8B F16	pp512 @ d131072	1735.63	905.39	2090.03
llama 8B F16	pp2048 @ d131072	2782.42	936.55	2288.65
llama 8B F16	tg128 @ d131072	28.60	24.79	43.32
llama 70B Q3_K - Small	pp512	1287.50	582.81	1072.80
llama 70B Q3_K - Small	pp2048	1954.83	590.45	1069.28
llama 70B Q3_K - Small	tg128	27.97	20.36	29.65
llama 70B Q3_K - Small	pp512 @ d32768	776.80	458.17	812.77
llama 70B Q3_K - Small	pp2048 @ d32768	1185.82	459.43	824.99
llama 70B Q3_K - Small	tg128 @ d32768	20.85	16.19	29.39

Generally speaking it can be observed that parallelizing larger models has better performance than parallelizing smaller models. Similarly, parallelizing the model becomes more worthwhile as the context depth increases. This makes sense as both of these result in a larger workload per GPU vs. the overhead from parallelization. Token generation benefits more from parallelization than prompt processing because the amount of data that needs to be transferred between GPUs is proportional to batch size - long-term it may make sense to implement support for FP16/BF16 compute types which would count the I/O in half vs. FP32. For pp512 pipeline parallelism is effectively disabled while for pp2048 it's enabled. With pipeline parallelism -sm layer is still faster than -sm tensor even at high context depths.

jeffbolznv · 2026-02-05T23:59:12Z

In principle all backends should work. CUDA does in my testing, Vulkan however does not. I think there may be some issues with deadlock between the GPUs. @jeffbolznv @0cc4m if you could take a look it would be appreciated.

I'm not seeing a deadlock, just a crash in the driver with an invalid descriptor. I ran llama-bench.exe -fa 1 -p 512 -n 0 -m c:\models\llama-2-7b.Q4_0.gguf -sm tensor

Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] | MessageID = 0xc23dafe5
vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[2].offset (18362368) is greater than or equal to buffer size (8388608).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)

-		dst_buf	{buffer=shared_ptr {buffer={m_buffer=0x0000be00000000be {...} } device_memory={m_deviceMemory=0x0000bf00000000bf {...} } ...} [0x00000003 strong refs] [make_shared] ...}	vk_subbuffer
+		buffer	shared_ptr {buffer={m_buffer=0x0000be00000000be {...} } device_memory={m_deviceMemory=0x0000bf00000000bf {...} } ...} [0x00000003 strong refs] [make_shared]	std::shared_ptr<vk_buffer_struct>
		offset	0x0000000001183000	unsigned __int64
		size	0x0000000000800000	unsigned __int64
-		dst	0x0000018f98716fd0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f98385c80 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...} ...}	ggml_tensor *
		type	GGML_TYPE_F32 (0x00000000)	ggml_type
+		buffer	0x0000018f98385c80 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...}	ggml_backend_buffer *
+		ne	0x0000018f98716fe0 {0x0000000000001000, 0x0000000000000200, 0x0000000000000001, 0x0000000000000001}	__int64[0x00000004]
+		nb	0x0000018f98717000 {0x0000000000000004, 0x0000000000004000, 0x0000000000800000, 0x0000000000800000}	unsigned __int64[0x00000004]
		op	GGML_OP_ADD (0x00000002)	ggml_op
+		op_params	0x0000018f98717024 {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, ...}	int[0x00000010]
		flags	0x00000010	int
+		src	0x0000018f98717068 {0x00000191a1b1d1b0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f97d4f130 {iface=...} ...}, ...}	ggml_tensor *[0x0000000a]
-		view_src	0x00000191a1b1d1b0 {type=GGML_TYPE_F32 (0x00000000) buffer=0x0000018f97d4f130 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...} ...}	ggml_tensor *
		type	GGML_TYPE_F32 (0x00000000)	ggml_type
+		buffer	0x0000018f97d4f130 {iface={free_buffer=0x00007ffbc6647c50 {ggml-vulkan.dll!ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer *)} ...} ...}	ggml_backend_buffer *
+		ne	0x00000191a1b1d1c0 {0x0000000000001000, 0x0000000000000200, 0x0000000000000001, 0x0000000000000001}	__int64[0x00000004]
+		nb	0x00000191a1b1d1e0 {0x0000000000000004, 0x0000000000004000, 0x0000000000800000, 0x0000000000800000}	unsigned __int64[0x00000004]
		op	GGML_OP_MUL_MAT (0x0000001d)	ggml_op
+		op_params	0x00000191a1b1d204 {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, ...}	int[0x00000010]
		flags	0x00000010	int
+		src	0x00000191a1b1d248 {0x0000018f9f29d940 {type=GGML_TYPE_Q4_0 (0x00000002) buffer=0x0000018f97d4f670 {...} ...}, ...}	ggml_tensor *[0x0000000a]
+		view_src	0x0000000000000000 <NULL>	ggml_tensor *
		view_offs	0x0000000000000000	unsigned __int64
		data	0x0000000001184000	void *
+		name	0x00000191a1b1d2b0 "attn_out-0"	char[0x00000040]
		extra	0x0000000000000000	void *
+		padding	0x00000191a1b1d2f8 ""	char[0x00000008]
		view_offs	0x0000000000000000	unsigned __int64
		data	0x0000000001184000	void *
+		name	0x0000018f987170d0 "attn_out-0 (view)"	char[0x00000040]
		extra	0x0000000000000000	void *
+		padding	0x0000018f98717118 ""	char[0x00000008]

It seems like tensor->data (and tensor->view_src->data) are too large. I haven't debugged further.

jacekpoplawski · 2026-02-05T23:59:40Z

works for me on 2x3090 for llama 3 8B and Mistral Nemo 12B

on Devstral I have OOM (expected because model size?)

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30720.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 32212254720

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bdaa8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp", line=119, fmt=0x7ffff77bd88f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff773565b in ggml_backend_buffer_get_size (buffer=0x0) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:119
#7  0x00007ffff7743b0e in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x555557dc41f0, buft=0x55555b032b18) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:638
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x555557dc41f0, buft=0x55555b032b18) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6f3d6c2 in llama_kv_cache::llama_kv_cache (this=0x555556363770, model=..., type_k=GGML_TYPE_F16, type_v=GGML_TYPE_F16, v_trans=false, offload=true, unified=true, kv_size=393216,
    n_seq_max=4, n_pad=1, n_swa=0, swa_type=LLAMA_SWA_TYPE_NONE, filter=..., reuse=...) at /home/jacek/git/llama.cpp/src/llama-kv-cache.cpp:190
#10 0x00007ffff700faec in llama_model::create_memory (this=0x555556357ea0, params=..., cparams=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7617
#11 0x00007ffff6ed008f in llama_context::llama_context (this=0x555557a65260, model=..., params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:274
#12 0x00007ffff6edd53c in llama_init_from_model (model=0x555556357ea0, params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:3046
#13 0x00005555558f4e24 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1183
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636eba0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79b0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0c8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)

but on MInistral 14B too

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20480.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 21474836480

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bdaa8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp", line=119, fmt=0x7ffff77bd88f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff773565b in ggml_backend_buffer_get_size (buffer=0x0) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend.cpp:119
#7  0x00007ffff7743b0e in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x555557e339e0, buft=0x55555abb40b8) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:638
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x555557e339e0, buft=0x55555abb40b8) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6f3d6c2 in llama_kv_cache::llama_kv_cache (this=0x55555abb9470, model=..., type_k=GGML_TYPE_F16, type_v=GGML_TYPE_F16, v_trans=false, offload=true, unified=true, kv_size=262144,
    n_seq_max=4, n_pad=1, n_swa=0, swa_type=LLAMA_SWA_TYPE_NONE, filter=..., reuse=...) at /home/jacek/git/llama.cpp/src/llama-kv-cache.cpp:190
#10 0x00007ffff700faec in llama_model::create_memory (this=0x555556357e50, params=..., cparams=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7617
#11 0x00007ffff6ed008f in llama_context::llama_context (this=0x555557a65140, model=..., params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:274
#12 0x00007ffff6edd53c in llama_init_from_model (model=0x555556357e50, params=...) at /home/jacek/git/llama.cpp/src/llama-context.cpp:3046
#13 0x00005555558f4e24 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1183
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636eb30, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79b0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0c8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)

qwen 4B has a different issue

/home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:386: GGML_ASSERT(ne[split_dim] % n_simple_bufs == 0) failed

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at ./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:100
#3  0x00007fffedc4579e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffedc288cd in __GI_abort () at ./stdlib/abort.c:73
#5  0x00007ffff7719ec5 in ggml_abort (file=0x7ffff77bedd8 "/home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp", line=386, fmt=0x7ffff77beb9f "GGML_ASSERT(%s) failed")
    at /home/jacek/git/llama.cpp/ggml/src/ggml.c:256
#6  0x00007ffff7741e84 in ggml_backend_meta_buffer_init_tensor (buffer=0x55555a4421d0, tensor=0x55555a46df70) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:386
#7  0x00007ffff7743a52 in ggml_backend_meta_alloc_ctx_tensors_from_buft (ctx=0x55555a4422f0, buft=0x55555a426188) at /home/jacek/git/llama.cpp/ggml/src/ggml-backend-meta.cpp:632
#8  0x00007ffff7734a4a in ggml_backend_alloc_ctx_tensors_from_buft (ctx=0x55555a4422f0, buft=0x55555a426188) at /home/jacek/git/llama.cpp/ggml/src/ggml-alloc.c:1245
#9  0x00007ffff6fff452 in llama_model::load_tensors (this=0x555556357e50, ml=...) at /home/jacek/git/llama.cpp/src/llama-model.cpp:7055
#10 0x00007ffff6e557fc in llama_model_load (fname="/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", splits=std::vector of length 0, capacity 0, model=..., params=...)
    at /home/jacek/git/llama.cpp/src/llama.cpp:876
#11 0x00007ffff6e56ce3 in llama_model_load_from_file_impl (path_model="/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", splits=std::vector of length 0, capacity 0, params=...)
    at /home/jacek/git/llama.cpp/src/llama.cpp:1069
#12 0x00007ffff6e56fe3 in llama_model_load_from_file (path_model=0x555556367610 "/mnt/models2/Qwen/Qwen_Qwen3-4B-Q8_0.gguf", params=...) at /home/jacek/git/llama.cpp/src/llama.cpp:1096
#13 0x00005555558f46c9 in common_init_result::common_init_result (this=0x55555624dc70, params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1107
#14 0x00005555558f50d0 in common_init_from_params (params=...) at /home/jacek/git/llama.cpp/common/common.cpp:1215
#15 0x0000555555710999 in server_context_impl::load_model (this=0x55555636ead0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:625
#16 0x00005555556e9ac8 in server_context::load_model (this=0x7fffffff79d0, params=...) at /home/jacek/git/llama.cpp/tools/server/server-context.cpp:2856
#17 0x00005555556177bd in main (argc=7, argv=0x7fffffffe0e8) at /home/jacek/git/llama.cpp/tools/server/server.cpp:248
(gdb)

DocShotgun · 2026-02-06T00:33:38Z

Interesting. If the CPU backend is able to be virtualized into multiple devices as described here, would it be possible to allow multiple NUMA nodes to be parallelized?

gopinath87607 · 2026-02-06T09:05:49Z

will this pr solve if we use multiple rpc connected to gpu with cpu and when we use cpu moe flag ?

JohannesGaessler · 2026-02-06T09:32:37Z

@jacekpoplawski the combination of --split-mode tensor and llama_params_fit is not implemented so you'll have to set the context size manually if you didn't already.

@DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do.

@gopinath87607 I don't understand what you mean.

FullstackSensei · 2026-02-06T10:08:58Z

@DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do.

If you're looking for a DDR4 platform, I might be able to help with that, but unfortunately, not so much with RAM for the system. I'm also in Germany.

Wouldn't mind giving you access to my systems if you want. Have two dual Xeon systems one with P40s and the other with Mi50s.

Would be very happy to help either way.

ggerganov · 2026-02-06T17:47:38Z

Started doing some initial tests to get familiar with the changes. I'm using virtual Metal devices and things appear to be mostly working - e.g. seeing graph execution on both devices, llama-perplexity produces the same result with 2 devices.

However, the following command does not produce identical results on each run:

GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor

I think this either means that there could be a problem with the fallback for the missing backend API, or that I could have made an error in the implementation of the events and the async copy in Metal. Will investigate more.

For context, all of them are optional and can use existing operations as a fallback.

In the meantime, @JohannesGaessler do you confirm that the command above has deterministic results on your end? Also, if is it deterministic with the fallback calls?

JohannesGaessler · 2026-02-06T18:49:28Z

Generally speaking you cannot expect bit-for-bit identical results if you split the computation across multiple virtual devices. The order in which floats are summed up will be different which will in turn change the rounding error. If I run llama-perplexity using LLaMA 3 8b f16 I get 6.2560 for -sm layer and 6.2556 for -sm tensor. It's of course still possible that there are bugs on top of that, I would just say changes to the results are expected. Long-term I think we should test this new code in the same way we test ggml op fusion in test-backend-ops.

If you use -sm tensor with only a single GPU the executed ops should be the exact same and the result should be bit-for-bit identical to -sm layer.

JohannesGaessler · 2026-02-06T18:55:20Z

Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization.

ggerganov · 2026-02-06T18:59:53Z

Yes, I understand that 1GPU vs 2GPU will not be bit-for-bit identical. What I mean is that in my test, running the command with 2 GPUs several times produces non-deterministic results from one run to the other:

GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor

# run 1
I believe the meaning of life is to find your gift. The purpose of life is to give it away.
I believe that the meaning of life is to find your gift. The purpose of life

# run 2
I believe the meaning of life is to find your gift. The purpose of life is to give it away. To give it away, you have to find it. [end of text]

# run 3
I believe the meaning of life is to be happy. I believe that happiness is the only thing that matters. I believe that happiness is the only thing that matters. I believe that happiness is the

Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization.

Yes, seems like a synchronization issue. Was wondering if you observe it on your end with and without the fallback backend API. This will give me indication where to look for the issue.

JohannesGaessler · 2026-02-06T19:05:50Z

With the command you posted (minus the GGML_METAL_DEVICES=2) I get deterministic results on 2x RTX 4090, both with and without the fallback for shfl_tensor_async. The output of llama-perplexity is bit-for-bit identical with vs. without the fallback. So presumably either there is a bug with Metal synchronization or I made assumptions about the behavior of ggml backends that are correct for CUDA but not universally.

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

…rf and stability (#12)

JohannesGaessler · 2026-03-25T15:08:02Z

I pushed a rebased version with the recent improvements by @gaugarg-nv as well as more comprehensive tests in test-llama-archs.

digitalscream · 2026-03-26T07:37:25Z

Not sure what the etiquette is here for folk who can't contribute meaningfully to the technical side of things, but here's some testing from the AMD/Vulkan world on a pair of R9700s:

./llama-bench -m /opt/working/models.2/llama-2-7b-chat.Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048 -n 128 -r 2 -sm layer,tensor
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |           pp512 |      4683.52 ± 52.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |          pp1024 |     5778.65 ± 981.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |          pp2048 |      7473.95 ± 18.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  layer |  1 |           tg128 |        117.84 ± 0.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |           pp512 |       1743.54 ± 0.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |          pp1024 |       1727.39 ± 0.78 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |          pp2048 |       1708.89 ± 0.86 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 | tensor |  1 |           tg128 |         51.44 ± 0.24 |

./llama-bench -m /opt/working/models.1/Qwen3-Coder-30B-A3B-Instruct-Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048 -n 128 -r 2 -sm layer,tensor
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |           pp512 |      3819.46 ± 43.47 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |          pp1024 |     4891.56 ± 546.52 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |          pp2048 |      5820.59 ± 19.19 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 |  layer |  1 |           tg128 |        150.67 ± 0.06 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |           pp512 |       1692.37 ± 7.81 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |          pp1024 |       1668.51 ± 9.22 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |          pp2048 |       1645.70 ± 3.88 |
| qwen3moe 30B.A3B Q4_0          |  16.18 GiB |    30.53 B | Vulkan     |  99 | tensor |  1 |           tg128 |         35.91 ± 0.27 |

Dense: -73% PP, -56% TG
MoE: -77% PP, -72% TG

Is an optimisation pass planned for Vulkan, or is it coming in a later PR to avoid holding this one up?

JohannesGaessler · 2026-03-26T08:53:34Z

I am one of the CUDA/ROCm maintainers so those are the backends I'm targeting. I don't know what the exact plans of the Vulkan maintainers are.

digitalscream · 2026-03-26T08:55:33Z

I am one of the CUDA/ROCm maintainers so those are the backends I'm targeting. I don't know what the exact plans of the Vulkan maintainers are.

Fair enough - apologies, I shall step out of the way then :)

IMbackK · 2026-03-26T09:18:05Z

Vulkan can do device to device dma via device groups but Vulkan implementations have spoty support for this and the vulkan backed always goes via the cpu.

This is probubly the reason for the poor performance, @0cc4m knows exact details

0cc4m · 2026-03-26T09:46:41Z

I don't know what it is exactly, I'd have to profile it.

* llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <[email protected]> * cont : fix log type Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

JohannesGaessler requested review from 0cc4m, CISC, ggerganov, lhez, max-krasnyansky, reeselevine, rgerganov and taronaeo as code owners February 5, 2026 22:30

github-actions bot mentioned this pull request Feb 6, 2026

Reddit News Daily 2026-02-06 gitlawr/reddit-daily-news#147

Open

ggerganov mentioned this pull request Feb 6, 2026

metal : fix event synchronization in cpy_tensor_async #19402

Merged

JohannesGaessler and others added 16 commits March 25, 2026 15:03

fix tensor granularity

aa5cc16

more even memory distribution

9f3dd1f

use BF16 for allreduce

d7daaad

rebase fixup

bce083b

better error message for unsupported architectures

b1e9bab

Fix device mismatch during scatter of allReduce. (#11)

d2ff37f

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

Enable the previous allreduce implementation. It is better in both pe…

b1a3821

…rf and stability (#12)

delay AllReduce for Moe for less I/O

a7f4bd1

build : clean-up compile warnings

7ce3944

backend : move most of the meta backend API to ggml-backend-impl.h

3828d6a

cont : hide unused public API in the implementation

1c4788b

llama : use llama_device + remove ggml_backend_dev_is_meta()

16d1a3d

ggml-backend : remove unused alloc include

6b781fc

minor : remove regex include

1ab9cd5

ggml : introduce ggml-ext.h for staging new APIs

a1dac7e

rebase fixup

4e0b9ae

JohannesGaessler force-pushed the ggml-meta-backend-8 branch from 108fea3 to 4e0b9ae Compare March 25, 2026 15:06

github-actions bot added Hexagon WebGPU labels Mar 25, 2026

fix tests

56157d2

ggerganov and others added 3 commits March 26, 2026 14:04

disable roundtrip for meta backend

150d488

fix arch selection

6e31365

gaugarg-nv mentioned this pull request Mar 27, 2026

[CUDA] Reduce the number of stream-k blocks to reduce the overhead of the flash_attn_stream_k_fixup kernel #21086

Open

Conversation

JohannesGaessler commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ggml Backend Interface Changes

Meta Backend Implementation Details

Current Issues/Limitations

Performance

Uh oh!

jeffbolznv commented Feb 5, 2026

Uh oh!

jacekpoplawski commented Feb 5, 2026

Uh oh!

DocShotgun commented Feb 6, 2026

Uh oh!

gopinath87607 commented Feb 6, 2026

Uh oh!

JohannesGaessler commented Feb 6, 2026

Uh oh!

FullstackSensei commented Feb 6, 2026

Uh oh!

ggerganov commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 6, 2026

Uh oh!

ggerganov commented Feb 6, 2026

Uh oh!

JohannesGaessler commented Feb 6, 2026

Uh oh!

JohannesGaessler commented Mar 25, 2026

Uh oh!

digitalscream commented Mar 26, 2026

Uh oh!

JohannesGaessler commented Mar 26, 2026

Uh oh!

digitalscream commented Mar 26, 2026

Uh oh!

IMbackK commented Mar 26, 2026

Uh oh!

0cc4m commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

JohannesGaessler commented Feb 5, 2026 •

edited

Loading

ggerganov commented Feb 6, 2026 •

edited

Loading

JohannesGaessler commented Feb 6, 2026 •

edited

Loading