Conversation
|
I tried and got this: I did also check over my system and found SW_POWERCAP activates momentarily and goes away in ik_llama even at low utilization. Could be the undervolting/clock locking but I observed it with that disabled and while using 2x gpu, etc. Will check other inference engines, but WAN over NCCL utilizes 90% GPU and runs the cards over 250w for 10 minutes at a time. [I also checked at the wall] IK is showcasing it at 140-180w. Too cold to try without risers/less gpu/etc. Counters show it spent a lot of time in that state tho: About the only lead I have. |
|
This assert was a typo, sorry. It should work now. |
|
This was actually a bit of an ingenious workaround. At 8192 I get decent PP.
the powercap may be only nvidia-persistance daemon doing it's thing though. Perhaps this is also a numa issue since this model doesn't fit on one node and TG on single node is slower for me either way. |
|
@Ph0rk0z All we need to do now is to figure out why your TG performance is so bad... If you feel like experimenting, edit to Rebuild, and run Then upload the generated file here (I think it needs to have a |
|
Heh.. that's pretty cool. I guess it will show you what ops are really laggy. |
|
Wow this PR is so ingenious idea! I have 2 5090s at X8/X8 5.0 from CPU and the rest slower via other ways. I.e. I have my devices ordered like this: 5090, 4090, 4090, 5090, A6000, A40, and I load GLM 4.5 IQ4_XS like this To make the TP part happen on the 5090s, I would have to add
If I understood correctly? Or would it be better to reorder the devices to have 5090,5090 first and then the other GPUs? EDIT: Okay I tried a lot but the 2nd 5090 always get a cuda illegal memory access sadly, running with
I got -sm row or default works fine. |
|
@Panchovix It should work like that, so there is a remaining bug lurking around. If you would run the command you used with (and then |
|
@ikawrakow Okay, here is the log! |
|
I believe you should do that prior to running the cmake --build build --config Debug -j $(nproc)Moreover, you'd have to do the |
|
@magikRUKKOLA I did build with debug now but log seems to be the same. Here is the full log, omitting which tensor to which GPU as it too long for the log. |
|
Wait, there was a debug update 1 hour ago haha, let me update and try again. EDIT: nope, it's the same. |
|
I don't see a real error in the above. Perhaps you can try CUDA graphs normally get disabled after a few attempts to capture a graph, but it looks like in your case it leads to a real error? One cannot use CUDA graphs with split mode "graph", so I guess I need to fix the code to not even attempt to capture a graph. |
|
@ikawrakow ran
And got the same issue I think, but it loaded way faster haha |
|
OK, so now you can try running with |
|
@ikawrakow (also did an export GGML_CUDA_DISABLE_GRAPHS=1 before in any case, not sure if it has to be run like that) Got a different error |
|
Thanks! I don't have a hypothesis of what could be wrong, but at least we now know the kernel where the illegal memory access occurs. Btw., I'm noticing that you are using overrides of the type etc. |
|
That for both CUDA 0 and CUDA 1 devices? And thanks for all the work! |
|
All of your |
|
Something like this? Or did I got confused? Would that help with default parallel processing as well? |
|
Wait, using that let me to load the model! |
|
That should be it. I don't think this will be better for sm "layer". This is specific to sm "graph". But it doesn't hurt to try and see what happens. |
|
Okay now it starts to gen! But got an error when generating Ran with EDIT: nvm it was a prompt issue I think, now it works fine! |
|
Cool that it works! I'm curious if it gives you a better performance than the default split mode "layer". |
|
Okay with graph I get Running a similar command but adapter for non graph (aka using less on other gpus and more on cuda 0 and 1), I get So sadly on my case it's a bit slower but I'm not surprised either, as too much GPUs are running on chipset lanes (I hope in some weeks help myself a little with some switches on CPU slots) |
|
Мега работа вършиш, браво на гения! Sorry to bother but what could be the next steps in case one have a bunch of RTX 3090? Unfortunately, right now I do have only 3 GPU system but pretty soon we will build the quad RTX 3090 with water cooling and DDR5 4800 MT/s. We will have the NvLinks as well. The 96 GB of VRAM total (the Xeon QYFS does not support more than four PCIe v.4 x16 anyways). It should be possible to use the speculative decoding and offload it to the LAN-connected machine (which would be like simple two GPU rig with no RAM whatsoever)? |
template command: /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
--warmup-batch \
--model /opt/THIREUS/GLM-4.6-5.4976bpw/GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01760.gguf \
--alias THIREUS/GLM-4.6-5.4976bpw \
--ctx-size $((64 * 1024)) \
-b $((4 * 1024)) -ub $((4 * 1024)) \
--mlock \
--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.1 \
-ctk q6_0 \
-ctv q6_0 \
-khad \
-amb 512 \
--split-mode layer \
-ts 100,100,0 \
--main-gpu 2 \
-ot "blk.(0|1|2|3|4|5).ffn_(up|gate|down)_exps.weight=CUDA0" \
-ot "blk.(6|7|8|9).ffn_(up|gate|down)_exps.weight=CUDA1" \
-ot "blk.(10|11|12|13|14|15|16|17).ffn_(up|gate|down)_exps.weight=CUDA2" \
-gr \
-ger \
--cpu-moe \
--merge-qkv \
--n-gpu-layers 99 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--verbosity 1 \
--verbose-prompt \
--reasoning-format auto \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1Note. GRAPHS *related: #1026 data: DetailsFile: bench-v4036-3gpu-graph-v4046-default-split.log File: bench-v4036-3gpu-graph-v4046-empty-splits-enabled-q6_0kv_khad.log File: bench-v4036-3gpu-layer-v4046-default-split.log File: bench-v4036-3gpu-layer-v4046-empty-splits-enabled-q6_0kv_khad.log system info DetailsDetected 3 NVIDIA GPU(s)
NVOC System Report
==================
System: Linux xxx 6.16.12+deb14+1-amd64 x86_64
Driver Version: 580.105.08
System Temperatures:
CPU: 68.0°C [OK] (Nominal: 35.0°C, Warn: 97.0°C, Crit: 100.0°C)
RAM: 73.0°C [OK] (Nominal: 35.0°C, Warn: 81.0°C, Crit: 83.0°C)
VR: 80.0°C [OK] (Nominal: 35.0°C, Warn: 115.0°C, Crit: 120.0°C)
GPU 0: NVIDIA GeForce RTX 3090
------------------------------------------------
PCI Bus ID: 00000000:41:00.0
VBIOS Version: 94.02.4B.00.0B
Persistence Mode: Enabled
Core Temperature: 67°C
Power Usage: 220W
Current Power Limit: 400W
Power Limits: Default: 350W, Min: 100W, Max: 400W
GPU Clock: 2025 MHz
VRAM Clock: 10251 MHz
GPU Utilization: 24%
VRAM Utilization: 18%
Memory Usage: 22.2 / 24.0 GB
Applied Offsets: GPU: 100 MHz, VRAM: 1500 MHz
GPU 1: NVIDIA GeForce RTX 3090
------------------------------------------------
PCI Bus ID: 00000000:42:00.0
VBIOS Version: 94.02.4B.00.0B
Persistence Mode: Enabled
Core Temperature: 52°C
Power Usage: 202W
Current Power Limit: 400W
Power Limits: Default: 350W, Min: 100W, Max: 400W
GPU Clock: 2070 MHz
VRAM Clock: 10251 MHz
GPU Utilization: 22%
VRAM Utilization: 16%
Memory Usage: 22.8 / 24.0 GB
Applied Offsets: GPU: 100 MHz, VRAM: 1500 MHz
GPU 2: NVIDIA GeForce RTX 3090
------------------------------------------------
PCI Bus ID: 00000000:61:00.0
VBIOS Version: 94.02.4B.00.0B
Persistence Mode: Enabled
Core Temperature: 57°C
Power Usage: 160W
Current Power Limit: 400W
Power Limits: Default: 350W, Min: 100W, Max: 400W
GPU Clock: 2040 MHz
VRAM Clock: 10251 MHz
GPU Utilization: 5%
VRAM Utilization: 3%
Memory Usage: 22.3 / 24.0 GB
Applied Offsets: GPU: 100 MHz, VRAM: 1500 MHz
Peer-to-Peer (P2P) Support Matrix:
=================================
GPU 0 -> GPU 1: Supported
GPU 0 -> GPU 2: Supported
GPU 1 -> GPU 0: Supported
GPU 1 -> GPU 2: Supported
GPU 2 -> GPU 0: Supported
GPU 2 -> GPU 1: Supporteda dodgy tool to generate the graphs: DetailsFile: utils/generate_svgs.sh #!/bin/bash
# Script to generate decode.svg and prefill.svg from benchmark logs
# Usage: ./generate_svgs.sh [layer_log] [graph_log]
LAYER_LOG="${1:-/opt/ubergarm/GLM-4.5-Air-GGUF/IQ1_KT/bench-sm-layer-f16.log}"
GRAPH_LOG="${2:-/opt/ubergarm/GLM-4.5-Air-GGUF/IQ1_KT/bench-sm-graph-f16.log}"
# Colors: red for layer, blackish for graph (as requested)
LAYER_COLOR="#ea4612"
GRAPH_COLOR="#333333"
if [ ! -f "$LAYER_LOG" ] || [ ! -f "$GRAPH_LOG" ]; then
echo "Error: Log files not found!"
exit 1
fi
create_chart() {
local chart_type="$1"
local col output_file title
if [ "$chart_type" = "decode" ]; then
col=8 # S_TG t/s column (after removing spaces)
title="Decode Speed Comparison (S_TG t/s)"
output_file="decode.svg"
else
col=6 # S_PP t/s column (after removing spaces)
title="Prefill Speed Comparison (S_PP t/s)"
output_file="prefill.svg"
fi
echo "Generating $output_file..."
# Extract data - lines with numeric N_KV values in column 4
awk -v col_num="$col" -F'|' '$4 ~ /^[[:space:]]*[0-9]/ {
gsub(/ /, "");
print $4, $(col_num)
}' "$LAYER_LOG" | sort -n > "/tmp/layer_${chart_type}.dat"
awk -v col_num="$col" -F'|' '$4 ~ /^[[:space:]]*[0-9]/ {
gsub(/ /, "");
print $4, $(col_num)
}' "$GRAPH_LOG" | sort -n > "/tmp/graph_${chart_type}.dat"
echo " Layer data points: $(wc -l < "/tmp/layer_${chart_type}.dat")"
echo " Graph data points: $(wc -l < "/tmp/graph_${chart_type}.dat")"
# Use environment variables to pass parameters
export chart_type title layer_color graph_color output_file
python3 << 'PYEOF'
import sys, os
import math
title = os.environ.get('title', 'Chart')
layer_color = os.environ.get('layer_color', '#ea4612')
graph_color = os.environ.get('graph_color', '#333333')
chart_type = os.environ.get('chart_type', 'decode')
output_file = os.environ.get('output_file', 'output.svg')
# Read layer data
layer_nkv = []
layer_speed = []
with open(f"/tmp/layer_{chart_type}.dat", "r") as f:
for line in f:
parts = line.strip().split()
if len(parts) >= 2:
layer_nkv.append(float(parts[0]))
layer_speed.append(float(parts[1]))
# Read graph data
graph_nkv = []
graph_speed = []
with open(f"/tmp/graph_{chart_type}.dat", "r") as f:
for line in f:
parts = line.strip().split()
if len(parts) >= 2:
graph_nkv.append(float(parts[0]))
graph_speed.append(float(parts[1]))
if not layer_nkv or not graph_nkv:
print("Error: No valid data points found", file=sys.stderr)
sys.exit(1)
# Calculate ranges
max_kv = max(max(layer_nkv), max(graph_nkv))
min_s = min(min(layer_speed), min(graph_speed))
max_s = max(max(layer_speed), max(graph_speed))
range_s = max_s - min_s
if range_s == 0:
range_s = max_s * 0.1
# Add small padding to the speed range
min_s -= 0.05 * range_s
max_s += 0.05 * range_s
range_s = max_s - min_s
def scale_x(nkv):
return 60 + (nkv / max_kv) * 720
def scale_y(speed):
if range_s == 0:
return 190
return 40 + ((max_s - speed) / range_s) * 300
# Generate SVG to temp file
with open(f"/tmp/{output_file}", 'w') as svg_out:
svg_out.write('<?xml version="1.0" encoding="UTF-8"?>\n')
svg_out.write('<svg width="800" height="400" viewBox="0 0 800 400" xmlns="http://www.w3.org/2000/svg">\n')
# Background and title
svg_out.write(f' <rect width="800" height="400" fill="#f8f9fa"/>\n')
svg_out.write(f' <text x="400" y="25" text-anchor="middle" font-family="Arial, sans-serif" font-size="16" font-weight="bold" fill="#212529">{title}</text>\n')
svg_out.write(' <text x="30" y="200" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" fill="#495057" transform="rotate(-90 30 200)">Speed (tokens/second)</text>\n')
svg_out.write(' <text x="400" y="385" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" fill="#495057">N_KV (cache tokens)</text>\n')
# Calculate Y-axis labels with intelligent spacing
range_val = max_s - min_s
# Determine appropriate step size
if range_val <= 5:
step = 1
elif range_val <= 10:
step = 2
elif range_val <= 20:
step = 4
elif range_val <= 50:
step = 8
else:
# For larger ranges, use steps of 16 or more
base_step = max(1, int(range_val / 6))
# Round to nearest nice number (10, 20, 25, 50, etc.)
if base_step <= 5:
step = 4
elif base_step <= 15:
step = 16
else:
step = max(8, int(base_step / 2) * 3)
# Generate labels
label_min = math.floor(min_s / step) * step
label_max = math.ceil(max_s / step) * step
labels = []
v = label_min
while v <= label_max:
if min_s - step <= v <= max_s + step: # Extended range for better coverage
labels.append(v)
v += step
# Grid lines (horizontal at Y-axis label positions)
svg_out.write(' <g stroke="#e9ecef" stroke-width="1">\n')
# Horizontal grid lines
for speed_val in labels:
y_pos = scale_y(speed_val)
if 40 <= y_pos <= 340:
svg_out.write(f' <line x1="60" y1="{y_pos:.1f}" x2="780" y2="{y_pos:.1f}"/>\n')
# Vertical grid lines (keep original positions)
svg_out.write(' <line x1="240" y1="40" x2="240" y2="340"/>\n')
svg_out.write(' <line x1="420" y1="40" x2="420" y2="340"/>\n')
svg_out.write(' <line x1="600" y1="40" x2="600" y2="340"/>\n')
svg_out.write(' </g>\n')
# Axes
svg_out.write(' <g stroke="#212529" stroke-width="1.5">\n')
svg_out.write(' <line x1="60" y1="340" x2="780" y2="340"/>\n')
svg_out.write(' <line x1="60" y1="40" x2="60" y2="340"/>\n')
svg_out.write(' </g>\n')
# Plot border
svg_out.write(' <rect x="60" y="40" width="720" height="300" fill="none" stroke="#6c757d" stroke-width="1.5" opacity="0.7"/>\n')
# Axis labels
svg_out.write(' <g font-family="Arial, sans-serif" font-size="10" fill="#495057">\n')
# X-axis labels (N_KV values)
svg_out.write(' <text x="60" y="360" text-anchor="middle">0</text>\n')
if max_kv >= 10000:
svg_out.write(f' <text x="240" y="360" text-anchor="middle">{int(max_kv / 4 / 1000)}K</text>\n')
svg_out.write(f' <text x="420" y="360" text-anchor="middle">{int(max_kv / 2 / 1000)}K</text>\n')
svg_out.write(f' <text x="600" y="360" text-anchor="middle">{int(3 * max_kv / 4 / 1000)}K</text>\n')
svg_out.write(f' <text x="780" y="360" text-anchor="middle">{int(max_kv / 1000)}K</text>\n')
else:
svg_out.write(f' <text x="240" y="360" text-anchor="middle">{int(max_kv / 4)}</text>\n')
svg_out.write(f' <text x="420" y="360" text-anchor="middle">{int(max_kv / 2)}</text>\n')
svg_out.write(f' <text x="600" y="360" text-anchor="middle">{int(3 * max_kv / 4)}</text>\n')
svg_out.write(f' <text x="780" y="360" text-anchor="middle">{int(max_kv)}</text>\n')
# Y-axis labels
for speed_val in labels:
y_pos = scale_y(speed_val)
if 40 <= y_pos <= 340: # Only draw within plot area
svg_out.write(f' <text x="50" y="{y_pos:.1f}" text-anchor="end">{int(speed_val) if speed_val.is_integer() else speed_val}</text>\n')
svg_out.write(' </g>\n')
# Layer polyline
points = []
for i in range(len(layer_nkv)):
points.append(f"{scale_x(layer_nkv[i]):.1f},{scale_y(layer_speed[i]):.1f}")
svg_out.write(f' <!-- layer -->\n')
svg_out.write(f' <polyline fill="none" stroke="{layer_color}" stroke-width="2.5" points="{" ".join(points)}"/>\n')
# Graph polyline
points = []
for i in range(len(graph_nkv)):
points.append(f"{scale_x(graph_nkv[i]):.1f},{scale_y(graph_speed[i]):.1f}")
svg_out.write(f' <!-- graph -->\n')
svg_out.write(f' <polyline fill="none" stroke="{graph_color}" stroke-width="2.5" points="{" ".join(points)}"/>\n')
# Legend
svg_out.write(' <g font-family="Arial, sans-serif" font-size="11">\n')
svg_out.write(f' <rect x="540" y="45" width="12" height="3" fill="{layer_color}" rx="1"/>\n')
svg_out.write(' <text x="558" y="50" fill="#212529">layer</text>\n')
svg_out.write(f' <rect x="660" y="45" width="12" height="3" fill="{graph_color}" rx="1"/>\n')
svg_out.write(' <text x="678" y="50" fill="#212529">graph</text>\n')
svg_out.write(' </g>\n')
svg_out.write('</svg>\n')
PYEOF
if [ $? -eq 0 ]; then
mv -f "/tmp/${output_file}" "$output_file"
echo " Done: $output_file ($(wc -l < /tmp/layer_${chart_type}.dat) data points, max_kv=$(tail -1 /tmp/layer_${chart_type}.dat | awk '{print $1}'))"
fi
}
# Generate both charts
create_chart decode
create_chart prefill
rm -f /tmp/layer_*.dat /tmp/graph_*.dat 2>/dev/null
ls -lh decode.svg prefill.svg 2>/dev/null || echo "SVG files not created"
echo |
|
Thank you for these. So, it looks like for short context one gains a bit by using 2 GPUs for split mode "graph". But for long context having all 3 GPUs in the graph split clearly helps. |
|
Does that mean the more the GPUs the better overall performance (with an exception of the short context of 12k or so as shown above)? Will it potentially scale to 4 GPUs or 8 GPUs ? Can we expect such a support for other models like DeepSeek and K2 in the future? |
It is a matter of the balance between reducing computation time by parallelism, and the added latency for synchronization and copying data between GPUs. We have seen for some people sm "graph" is already slower for 2 GPUs. In your case 3 GPUs becomes better at a not too long context to be useful in practice. Where the break even point for 4 GPUs will be will depend on the system. Btw, when you get the 4-GPU box ready, we can use tailscale for me to login remotely and experiment when you are not using it. Concerning K2 and DeepSeek architecture: that is much harder, with not easy to predict outcome. I wrote more about that here. The DeepSeek self-attention is designed to have a slower performance degradation with increasing context length, but that makes it much harder to parallelize. I have done the split mode "graph" for LlaMA and derivatives, Qwen3-MoE, and GLM-4.5/4.6/AIR. It is not hard to add other architectures, I just need a request from a user willing to test. I cannot download and test models for every arch supported by |
Okay good to know. Yeah I was thinking to create some pam.d so everything dumps to the storage and shutdowns when you logging in etc. But man, it would take some time to build the water cooling properly. And in the meantime I have a bunch of risers laying around. I was thinking to use them to build a temporary quad-GPU rig when we are waiting for the water-cooling etc. |
|
So as an aside, I have revisited NCCL splits in exllama v3. One core is used per process and utilization is 99% with cards consuming about 240W each. On mistral 123B, I get over 450t/s prompt processing and about 16t/s output. The transfers reach about 5-6gb/s on all GPUs. So I am not constrained by PCIE traffic in the most likely case. I'll have to test llama-70b and see if I get similar speeds to what you all got for the prompt processing and do 2x vs 4x, etc. Apples to oranges, I know, but my point is that it is possible and a SW limitation. If magik's new rig is also single single proc/numa, offer to test still stands. Tailscale let me login with github so if you guys find a way to share that, I'm open to it whenever you get the urge to play with numa. |
Yes, it is. Its AMD Threadripper PRO 3995wx. |
|
Without having a corresponding performance data point with |
|
Yep, that's the plan. I have 70b llamas I can run. |
I'm dying to have the results of these benchmarks... |
Air-cooled 4xRTX 3090 perhaps could be relevant in this situation? ( #1029 (comment) ) |
|
Yes, sure, the 4x3090 box is interesting to play with. But what I'm right now extremely curious to know is the speed of The 4x3090 box is ready? Give me instructions what I need to do to log in, along with time slots when I can use it. |
Well ... actually no. :) But all the hardware for it is just laying around hence the question. |
|
So quick question. I have EXL3 qwen-72b and then GGUF qwen72b of similar filesize. Does graph mode work for that or only GLM and llama-2? There's also no sweep bench in exllama but I can take chats at various contexts and reroll with both. I have like 1k, 2k-4k, 22k conversations I can have it crunch from a fresh KV. Want to take out as many variables as possible without downloading the same model twice or you'll be waiting a while. |
|
Split mode "graph" does not work for a dense Qwen model (is it Qwen2 or Qwen3?). It doesn't take much to make it work, but you can start by comparing speed between EXL3 with TP and |
You want to run the greatest EXL, no? Else tomorrow you will tell me "Oh, we only compared with EXL2, but EXL3 is so much better". Or you will tell me "Oh, but we only compared against slow EXL3 TP, but the fast EXL3 TP is so much faster". |
|
Ok, I ran what I have and here are the results. The kimi was their original reasoning version so sometimes it schizos and starts doing that. I don't think anyone supported the new cohere here or I'd do the agatha vs agatha for a more closer test with LARGE models. 4xGPU Tess-v2.5.2-Qwen2-72B-GGUF 50.7gb Kimi-Dev-72B-EXL3 5.0bpw 47.3 gb No TP Native TP NCCL TP |
|
I see you use But apart from this, what I see is
So, with other words, your system is pretty bad at TP, but EXL3 being exceptionally slow without TP, you see a large benefit from TP there. Or is your interpretation of these results different? |
|
You could say that.. but it's not true for all model architectures. I also wouldn't say my system "sucks" at TP. Many backends don't support it. Pcie 3.0 with fake nvlink and xeon only gets you so far. Apparently cohere is supported and this is a model that actually needs the GPUs. Should add that yea, i only use Q8 cache on both. Sometimes Q6 which finally made an entrance here. Agatha-111B-v1-Q4_K_L 67.9GB Agatha-111B-v1-EXL3 5.0 BPW 76.1GB Native TP NCCL TP |
|
|
I didn't see anyone use it on mainline and remember the quirk where key had to be at least q8_0 or the ppl would start to get bad. I think value could be Q4_0 per the original PR. Can say llama.cpp in general has come far in terms of speeds and especially in this fork. If I had any mistrals in GGUF I could test that too. |
This is not |
|
His comment is the only reason I knew it existed here. |
Are there any LLMs you want to be pre-downloaded? The machine in question has 4TB storage. [EDIT]: preliminary results. 4 GPUs had been connected via the risers (one gpu with two risers, for an extra extension). I should possibly install some Noctua Industrial Fans etc. Ha. [EDIT2]:
So basically its ready. Lets agree upon the access protocol. I suggested somewhere around here that you could use like Tailscale or alternatively, we can provide a VPS the sole purpose of which is to provide the dedicated ip address and from there you'll have another SSH session (tunnel) via some other means to the server in question.
I don't think its necessary. I have a bunch of other boxes I can use in case you'd decide to use the quad-box. No worries at all. You'll have the root access. In case something bothering you (something is running etc.) -- just terminate it etc. |
|
Cool to see that test. Your b/w is basically doubled but our latencies are fairly comparable. There is another test, all to all where the speeds fall much more. |
Can you drop your results (of the p2pBandwidthLatencyTest from cuda-toolkit), huh? |
|
Man.. I thought I had already pasted it in here but maybe not? aikitoria/open-gpu-kernel-modules@7c82991#r165839881 |


The main purpose of this PR is to allow empty splits when using split mode "graph". If, for instance, one has 4 GPUs, and this leads to a bad performance with split mode "graph", one could try using
to put all attention tensors, shared experts tensors, and KV cache on GPU 0 and 1, and then use GPU 2 and 3 to offload MoE tensors. In that case tensor parallel (and corresponding synchronization plus data exchange overhead) for attention and shared experts is done on only 2 GPUs, hopefully resulting in a better performance.
@Ph0rk0z Can you try this? Hopefully I have not forgotten to add checks for empty splits everywhere where needed. Perhaps also worth trying with @magikRUKKOLA's system with 3 GPUs.