Conversation
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined.
This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG.
Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio.
but no tensor overrides yet, just ngl < num_layers.
Now PP is faster than split mode layer for L3-70B.
2x regular GPU
4x Regular GPU
2xGPU - Mainline
4x GPU - - Mainline
2x GPU - TP
4x GPU - TP
With 4x gpu, I don't see a lot of b/w. Maybe 2-3 GiB/s in nvtop. GPU usage is from 27-37% with one GPU at 46%. Most of the transfers really happen during PP anyway. I have turbo disabled (in ss, he doesn't) and PCIE3 through 2 PLX with an X16 link per PLX. I have seen higher transfer on nccl wan so I don't think it's transfer bottleneck just yet. Must be something else? |
|
What are the GPU's? 3090? If so, why would your split mode "layer" PP performance be 25% lower than mine? |
|
What a cool release and the read!! Thanks a lot!
Please note that the first GPU temperature is 79C and its getting thermal-throttled so the core clock is only 1920 MHz. The second GPU which is placed at the bottom is 57C and it allows it to utilize almost all of the applied offset of +100 MHz so 1950 -> 2040 MHz. So the following means that air-cooling is obsolete and its necessary to use the liquid cooling. Also I got a question regarding the BW. Is it only about 1.6 Gbit/sec ? Will it make any difference if you would install the NvLink bridge (3-slot bridge in your case)? [EDIT]:
Aha. So the NvLink should help then. But will it help for the config with 4 GPUs where only two NvLinks are installed? |
You run with full clocks. I have it capped at 1695. I can do at least 2 cards without.
t48 is left over from CPU running. It should be using a single thread. That's how mainline has been for ages.
1.6 isn't maximal bandwidth. That's just what I see on this model. Wan sees 6-8 in NCCL. I have also done p2p tests and all that jazz. btw, 4 is 2080ti 22g. not P2P Watch your nvtop and see what you get. It only appears briefly. edit: ok, so I did some tests and it's none of those things. All I can think of is the crappy single threaded performance of xeon vs threadripper or nvidia driver/cuda disfavoring ampere. It's 580.82.09 and cuda release 12.6, V12.6.20 |
|
I tested the branch yesterday, after "playing games with the scheduler", and today again once you released the PR. My GPU/PCIE setup is as such:
Command : First observation, the GPU occupation is now sufficient in TG to trigger the P2 State of my GPUs, instead of playing with crappy locked overclocks. And that's great for my comfort of use and the durability of my GPUs. Second, the PP is still lower for me than layer split, by a ratio of approximately 2.5 at 200 ctx and 2 at 10,000 ctx. Maybe my PCIE bandwith on GPU 1 and 2 is not enough to enjoy the benefit of TP, PP wise. On the other hand, the TG is 5 % to 10% higher, mainly due to a better thermal management of my cards now that I don't need to lock the frequency. This, with no overclock and a higher undervolt to keep the GPUs cool. If I maintain the fixed overclock the TG drops below split layer due to thermal constraints. Fourth, there's a bug when resetting the context. Let's say I chat using llama-server, then start another chat in the same client or another, and it provokes this: I can also have this error :
When loading an Q5_K_S 70B Llama 3.x model with split graph, while it poses no problem with split layer. I patched myself up by replacing that: With that: And then, in backend.cpp, line 217, commented an asset.
And it works. |
You do know that you can make a daemon that monitors the GPU utilization and overclocks it or undervolts it depending on load? Plus you can manage the fan speed.
That would be a very bad idea for sure. The better way is to apply the OC offsets for the GPU clock and the VRAM clock. This way the GPU itself would do the overclocking depending on the current temperatures. |
|
I'm not that savvy, especially under Windows 11. Nvidia is very pesky about states and frequencies, Cuda wise. So with my undervolt, I use MSI afterburner until I find the right kick to trigger P2. Now, with TP, it triggers much more easily. I'm a true amateur, Rukkola! ^^ |
|
The graphs below apparently show that two GPU config works faster than three. (not surprisingly) 3 x 3090, Llama-3.3-70B-Instruct-Q4_0.gguf Details#!/usr/bin/env bash
export MALLOC_CONF="background_thread:true,percpu_arena:phycpu,metadata_thp:auto,dirty_decay_ms:10000,muzzy_decay_ms:60000"
export LD_PRELOAD=/usr/local/lib/libjemalloc.so
ulimit -n 9999
ulimit -l unlimited
export CUDA_VISIBLE_DEVICES="0,1,2"
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
--warmup-batch \
-m /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/Llama-3.3-70B-Instruct-Q4_0.gguf \
-sm graph \
-ub 1024 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
-c 32768 \
-mla 3 \
-fa on \
-ctk q8_0 \
-ctv q8_0 \
--no-mmap \
-ngl 99File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-graph-f16-2gpu.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-layer-f16-2gpu.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-graph-f16-3gpu-72kctx.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-graph-f16.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-graph.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-layer-f16.log File: /opt/unsloth/Llama-3.3-70B-Instruct-GGUF/Q4_0/ik_llama.cpp-bench-sm-layer.log |
|
ik_llama.cpp/ggml/src/ggml-cuda.cu Line 3057 in bcdd303 Apparently the P2P access is never getting enabled huh? |
|
Thank you all for testing! @Nexesenex @Ph0rk0z Is it possible that you both don't have P2P copy enabled? If I disable P2P copy on my system (
From @magikRUKKOLA's results it looks like I need to think about limiting TP to 2 GPUs, so using GPU0,1 for the first N1 layers, GPU2,3 for the second N2 layers, etc. Odd number of GPU's is of course awkward as things don't divide nicely into equally sized portions. This is perhaps less of an issue for the FFN part, but for sure is an issue for self-attention. For instance, Llama-70B has 8 KV heads, so we can only split as 3,3,2 with 3 GPUs. |
This code is an irrelevant remnant from the split mode "row" implementation (and I guess I should remove it to avoid confusion). The copy between GPU's is now always done using this: ik_llama.cpp/ggml/src/ggml-cuda.cu Line 580 in b0818db |
I have zero experience with NvLink, so don't know the answer. The BW for the P2P copy is determined by the PCI-E BW of the two GPU's involved. It should be 30 GB/s in my case, but I observe 20-22 GB/s in practice. |
|
OK, I just made another change that allows to copy all partial results as
Some napkin math: before the last change we had 790 t/s, so 1024/790 = 1.296 seconds per batch of 1024. We now have 879 t/s, so 1024/879 =1.165 s/batch, so we saved 0.131 t/s per batch. Before the change we were copying 7.5 GiB of data per batch, now we copy 5 GiB/batch, so 2.5 GiB less per batch. This works out to 2.5 GiB/0.131 seconds = 19 GiB/second effectively for the P2P copy. |
Its from about +15% (beginning of the ctx) down to +5% (the end of 72k ctx) -- for the 3 GPU setup. Details```main: n_kv_max = 73728, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
|
|
Thanks for the data point. Yes, the amount of data exchanged between the GPUs does not change with context length. With increasing context length computation becomes more expensive, so the fraction of the time spent copying data between GPUs decreases, so the impact of this optimization decreases. But looking at all these datapoints, I'm starting to think that I should add another CUDA parameter that controls the precision of the intermediate GPU results being exchanged. It might as well be that |
I was wondering if its possible to estimate the quantization error prior to the F16 to Q8_0 conversion. That is, to use the CUDA code to first check the scale per block, etc. The mock code like: #include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cstdint>
// Simple Q8 block: 1 scale + 32 int8s
struct Q8Block {
half scale;
int8_t vals[32];
};
// Single decision function combining all checks
__device__ bool should_use_q8(const half* block, float* out_scale) {
// Find max and mean
float max_abs = 0.0f;
float sum_abs = 0.0f;
#pragma unroll
for (int i = 0; i < 32; i++) {
float v = fabsf(__half2float(block[i]));
max_abs = fmaxf(max_abs, v);
sum_abs += v;
}
*out_scale = max_abs / 127.0f;
float mean_abs = sum_abs / 32.0f;
// Check 1: Scale too small or too large
if (*out_scale < 1e-5f || *out_scale > 1.0f) return false;
// Check 2: Granularity (step size vs mean)
float granularity = *out_scale / (mean_abs + 1e-8f);
if (granularity > 0.02f) return false; // Max 2% steps
// Check 3: Actual max error
float max_err = 0.0f;
#pragma unroll
for (int i = 0; i < 32; i++) {
float orig = __half2float(block[i]);
int8_t q = __float2int_rn(orig / *out_scale);
q = max(-127, min(127, q));
float recon = (float)q * *out_scale;
float err = fabsf(orig - recon) / (fabsf(orig) + 1e-8f);
max_err = fmaxf(max_err, err);
}
return max_err < 0.01f; // Max 1% error
}
// Main kernel: activations in, mixed Q8/F16 out
__global__ void quantize_activations(
const half* __restrict__ activations, // Input activations (FP16)
Q8Block* __restrict__ q8_out, // Q8 output
half* __restrict__ f16_out, // FP16 fallback output
uint8_t* __restrict__ flags, // 1=Q8, 0=F16
int n_elements
) {
int block_idx = blockIdx.x * blockDim.x + threadIdx.x;
int offset = block_idx * 32;
if (offset >= n_elements) return;
// Load 32 elements
half vals[32];
#pragma unroll
for (int i = 0; i < 32; i++) {
vals[i] = (offset + i < n_elements) ?
activations[offset + i] : __float2half(0.0f);
}
// Decide
float scale;
if (should_use_q8(vals, &scale)) {
// Store as Q8
Q8Block block;
block.scale = __float2half(scale);
#pragma unroll
for (int i = 0; i < 32; i++) {
float v = __half2float(vals[i]);
block.vals[i] = max(-127, min(127, __float2int_rn(v / scale)));
}
q8_out[block_idx] = block;
flags[block_idx] = 1;
} else {
// Store as F16
for (int i = 0; i < 32 && offset + i < n_elements; i++) {
f16_out[offset + i] = vals[i];
}
flags[block_idx] = 0;
}
}
// Launch wrapper
void quantize_activation_tensor(
const half* d_activations,
Q8Block* d_q8_out,
half* d_f16_out,
uint8_t* d_flags,
int n_elements,
cudaStream_t stream = 0
) {
int n_blocks = (n_elements + 31) / 32;
int threads = 256;
int blocks = (n_blocks + threads - 1) / threads;
quantize_activations<<<blocks, threads, 0, stream>>>(
d_activations, d_q8_out, d_f16_out, d_flags, n_elements
);
} |
|
This is of course possible, but relatively expensive. It then also becomes a nightmare to handle: one GPU decides that Overall I think it is just easier to make it a command line option. We convert activations to |
|
OK, the P2P copy is not disabled, but is it also supported? The easiest way to find out is to just rebuild with Btw, in your initial benchmarks, where you also have |
|
I've got patched driver and tested with NCCL. It is definitely supported. Transfer speeds without P2P are terrible, so is latency. Mainline 2 cards.
There's overhead with that many GPU and only a single thread is used at 100%. I remember getting 17.99t/s on these probably same llama models when I had only 2x3090, some P40s and a broadwell xeon. CPU single core perf and probably the driver play into it. I got my OG server in april of '23 and the upgrade board I had to fix around january of '24. Wish I had bought epyc on H12SSL because they appreciated in price and would have probably had fewer issues. Was just lazy and wanted a "full" solution. Didn't know small caveats like PLX, QPI, etc would have such effects. Can also try to do some runs for exllama2/3 with similar sized models but there's no sweep bench so might be apples to oranges. |
Just a small note. Well, in case you're sure that by using cudaMemcpyPeerAsync it automatically utilizes the P2P transfers (/optimizations in terms of topology etc.), that's cool (perhaps I need to read more docs etc.). But keep in mind that if you're using the bidirectional transfers with the currently installed P2P-enabled 580.105.08 then your unidirectional speed via P2P PCIe will be: .... /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTestquote: *please note that the P620 Lenovo usually about 5% slower in P2P transfers for some unknown reason than the regular motherboard with AMD Threadripper. But that is negligible for our particular case. So the theoretical max for P2P transfers in one direction is about 26 GB/s. You could get about 2 times more like: but its only foir the bidirectional trasfers which have no place (right huh?) in the sequential algo you described? That said, [there is a change that] NvLink would perform somewhat better since the specs of NvLink are saying that the speed should be about 2 times better than P2P transfers. Well, it would also allow to take off the load from the CPU servicing the PCIe lanes. In case of AMD Threadripper PRO its likely pointless -- the CPU supports up to 128 PCIe lanes (the 3995wx) so its will be about 4 GPUs with PCIe 4.0 with full x16, but only 3 GPU PCIe 4.0 x8. The rest of the lanes went somewhere else. So ideally one would strive for 4 x GPU workstation. In case of RTX 3090 (which are really cheap nowadays ... its about 700 EUR a pop) one could get a workstation equal in terms of VRAM to RTX 6000 Pro Blackwell for a third of a price! Should we consider building the workstations with 8 GPUs instead of 4? I am suggesting we could use the NvLink for the GPUs that are connected to the CPU via PCIe 4.0x8 (or whatever) etc. So that the each NvLink would connect the the GPU with x16 to the GPU w[h]ich is sitting at x8. This way we will be able to connect 8 x 3090 RTX per workstation. Its 192 GB per workstation. The next step would be of course to find the right network cards to connect two or three of them (P2P, via ring-architecture) together and get the target space of 400 or 600 GB of pure VRAM. The systems such as these would require a dedicated AC and the liquid cooling of course (it would be beneficial for the overclocking as well). What do you think? Is it worth it? Do you have any better ideas of what to do in the pursuit of the optimal performance? |
|
FWIW, nvlink + p2p driver doesn't work together. |
|
I thought about bidirectional copies, but it was difficult (difficult as in I didn't manage to do it) to convince the backend to do the right graph splits. I didn't know that for 2 GPU's bidirectional is twice as fast as one direction, so didn't try harder. But if it is true, then for 2 GPUs doing the following will be faster
For 4 GPUs we currently have 6 copies per synchronization point (GPU-1,2,3, copy their data to GPU-0, GPU-0 computes the sum, copies the result back to GPU-1,2,3). If we change to each GPU copying its results to every other GPU, we will make 3x4 = 12 copies. If bidirectional is twice as fast, then it will take the same time as it currently does. For 8 GPUs we currently make 14 copies. Going to each GPU copying its result to every other GPU will require 7x8 = 56 copies. Hence, even if bidirectional is twice as fast, it will be still 2 times slower. Concerning building multi-GPU boxes: not an expert in hardware, but from what I have seen so far with my TP implementation, having fewer bigger (more VRAM) and faster GPUs seems better than having more GPUs with the same amount of VRAM. I now have TP for GLM-4.5/4.6 ready, and there TG is slower than split mode layer even for 2 GPUs. I'll make another PR in a bit and give more details there. |
|
Closed in favor of #1022, which includes all changes in this branch. |





This is a very rough around the edges POC for tensor parallelism (TP). It is a completely different implementation than TP via split mode "row" in mainline llama.cpp. I prefer to call it "graph parallel" rather than "tensor parallel", but I guess TP is the better known term.
I have (almost?) completely removed the split mode "row" related code inherited from mainline (but broken in
ik_llama.cppfor a while). TP is realized when building the computation graph, instead of black magic woodoo in the CUDA code. Model tensors are split across rows or columns (depending on tensor) to allow whole portions of the computation graph to be computed in parallel without synchronization between GPUs (so, that's why "graph parallel"). Instead of having to synchronize after every matrix multiplication, as it is the case in mainline's split mode "row", there are basically two synchronization points per model layer -- one after self attention, and one after the feed-forward network -- where results from the GPUs involved get added together (instead of being concatenated as it is in split mode "row"). Most importantly, KV cache for each layer is also split between GPUs, and self attention is computed in parallel, includingV * softmax(K*Q), which is the most computationally expensive part of the transformer architecture for long contexts.The POC is very rough because
-mqkv,-gr)Nevertheless, I wanted to put it out there for visibility (and to give mainline developers more time to fully independently discover the approach /s).
To use it, the command line option is
-sm graphor--split-mode graph. How much gets offloaded to what GPU can still be controlled via-tsor--tensor-split(but if not provided, split is determined by available VRAM).I have developed and tested on a 2x3090 system donated by @magikRUKKOLA, so many thanks again!
For the 8B LlaMA model, TP is still slower than split mode "layer", except for long context PP. But for the 70B LlaMA, this TP implementation beats split mode "layer" for PP and TG. Here are some
sweep-benchgraphs and tables for these two models quantized withQ4_0(so I can run the same model inik_llama.cppandllama.cpp). With the 70B model I can only go to a context of 16k tokens with the 2x3090 system and full offload.LlaMA-8B
ik_llama.cpp, split mode "layer"
ik_llama.cpp, split mode "graph"
llama.cpp, split mode "layer"
llama.cpp, split mode "row"
LlaMA 70B
ik_llama.cpp, split mode "layer"
ik_llama.cpp, split mode "graph"
llama.cpp, split mode "layer"
llama.cpp, split mode "row"
Some details
Feed-forward-network (FFN)
Ignoring for simplicity biases and such, one needs to compute
D * ((U*X) . unary(G*X)), where*stands for matrix multiplication, the dot indicates element wise multiplication,Xare the input activations,Gis the gate tensor (ffn_gate),Uis the up tensor (ffn_up), andDis the down tensor (ffn_down).UandGareM x N,XisM x K, andDisN x M(withMbeing the embedding size,Nthe FFN size, andKthe batch size, i.e., number of tokens). The outcome ofU*XandG*XareN x Kmatrices. Let us for simplicity consider splitting these operations between 2 GPUs.The split mode "row" approach does the following
U,Gare split into twoM x N/2tensors, andDis split intoN x M/2(i.e., split along rows)U*XandG*Xresult inN/2 x KmatricesN x Kmatrix. This requires synchronization between the GPUsDhalves is done in parallelThe approach implemented here does the following
U,Gare split into twoM x N/2tensors (i.e., split along rows), whileDis split intoN/2 x Mtensors (i.e., split along columns)D * ((U*X) . unary(G*X))operation with its halfSelf attention
Ignoring biases and intermediate normalizations for simplicity, here one needs to compute
W_O * (V * softmax(K * rope(W_Q*X))). HereKandVstand for the K- and V-cache,Xare the activations as above, andW_0is the attention output tensor. Before computing these operations, one needs to computerope(W_K * X)andW_V * X, and store (copy) the result into the K- and V-cache. The PR does the following:W_K, W_QandW_Valong rowsW_0along columnsrope(W_K * X)andW_V * X, and can store (copy) the result into its own KV cacheV * softmax(K * rope(W_Q * X))(using flash attention), and then perform the matrix multiplication with its portion ofW_O.Compute graph handling
The "split backend", which in the split mode "row" implementation is used to coordinate GPU synchronization and to split the work between the GPUs, is only used to coordinate model loading and data copy of the appropriate model tensors portions to the corresponding GPUs. It is not used during computation at all. Instead, when the compute graph is built, the self attention and FFN portions are done per GPU, which allows the
ggmlback-end to construct graph splits that can be computed on the GPUs in parallel. Theggmlbackend required a bit of help to do the right thing. I have added a parameter in theggml_addoperation. When set, the back-end starts a new graph split. Without this little hack, the backend would construct graph splits, which would prevent GPU1 to start working before GPU0 has done its work. Doing it in this way has the distinct advantage of allowing easier TP implementation for other back-ends (e.g., Vulkan), as the only thing that is required is to implement the model tensor data copying to the appropriate GPUs. Still, in retrospectI'm not 100% that this was the right decision becauseNext steps
Additional notes
Data transfer between GPUs is entirely non-negligible during PP. Here is some napkin math for LlaMA-70B
fp16, so the actually copied data is 6.25 GiB, which takes about 0.3 seconds. So, the maximum possible t/s would be 1024/(0.68 + 0.3) = 1045 t/s. Hence, the observed 790 t/s is about 75% of theoretically achievable, which means that in addition to the data transfer bottleneck, there is also a significant synchronization overhead.