--numa mirror: mirror model weights to every Numa node in the system#16000
--numa mirror: mirror model weights to every Numa node in the system#16000dbsanfte wants to merge 25 commits intoggml-org:masterfrom
--numa mirror: mirror model weights to every Numa node in the system#16000Conversation
- Achieved 5% inference speed improvement (14.6 -> 15.3 t/s) - Clean explicit NUMA setup during model loading - Ultra-minimal hot path with thread-local NUMA node access - Working NUMA mirrors for all model weights - Performance: text generation improved, prompt processing needs optimization Performance Results (Qwen3-30B-A3B): - Text Generation: 14.6 -> 15.3 t/s (+5% improvement) - Prompt Processing: 176 -> 152 t/s (14% regression - needs investigation) Technical Implementation: - tensor_data(): O(1) NUMA-aware access via thread-local ggml_current_numa_node - tensor_set_data_with_numa_mirrors(): Explicit NUMA setup for model weights - NUMA coordinator: Thread binding and memory locality - Clean separation: model loading (explicit setup) vs inference (fast access)
|
Physical core detection was very broken in |
|
Pretty close. There's still some overhead - I have too many OpenMP regions and need to consolidate them. But the best I got from llama-cpp even with the mirror fix was ~85GB/s. I regularly see 165+ with Llaminar. |
|
It still uses GGUF? i guess I'll try it when it's out. |
|
Yup it uses gguf. I might add other formats too, it's not really that hard. It all gets repacked for AVX512-VNNI anyway. |
|
@dbsanfte will your system be able to model expert-parallelism? it doesn't seem to fit well in llama since NUMA nodes would need to be a first-class concept (addressable at the model graph level) and that doesn't seem to be the case as I understand it. |
|
Yeah. I started with Qwen 2.5 to begin with but I'll do Qwen3 and Qwen3-MoE next. I designed it around a reusable pipeline system so it's extremely flexible. |
|
I've been looking for further NUMA optimizations and found this thread. My setup is a dual 5th gen Xeon server with 2 nodes per socket (768gb ddr5-5600 total) and a single RTX Pro 6000, which is somewhat similar to your setup. I had always been using I was curious about your suggestion to use I'm wondering if this is something you noticed as well. Perhaps the mmap-based NUMA "migration" results in a less efficient memory configuration for host->device offload during PP? |
I also found this on my dual xeons. One thing that might help further is to use all your threads (ie: hyperthreads included). This gained me back some of the lost PP for
Yeah, I can't figure this out either:
I suspected that it could be an alignment problem due to the weird 6.5GB/s I can get occasionally, but alas I tied changing the stock 32 byte alignment inside of So now, all I can think is that the I get exactly the same problem on |
My guess is that the mmap+distribute “migrate” trick puts the weights all over the place in a way that makes it harder to do host->device offload efficiently, but that’s just me throwing stuff out there lol. The best overall setup I’ve gotten so far is to interleave+distribute+no mmap on 2 nodes on the same socket (if the model fits in 384gb ram). During PP the GPU can hit 100% util and I can get like 50+ GB/s across PCIe, and the TG speed is fair. It just feels bad not using the other socket lol. With the drop_caches+mmap+distribute trick, I could only get like 8 GB/s across PCIe and GPU like 5-15% util during PP (but excellent TG!). Interleave+distribute+no mmap across all nodes is… acceptable for models that are too big for a single socket, but for smaller models it’s still a performance loss in TG for me. I asked local Minimax last night to investigate whether multi-NUMA expert parallelism would be a viable solution but I think it got lost in the weeds lol. |
|
Mirroring is just a kludge. It would be better if local threads only accessed local memory and worked in parallel. I have even turned off numa in the system and it behaves identical to interleave/distribute but the QPI still gets saturated. When calculations are moved to the GPU they might be crossing QPI and then PCIE as well. My PCIE3 all to all speeds are similar, around 6.5-7gb/s. I'm not sure that is a bottleneck. So what am I trying to say... that there's no way to fix this with small changes or settings. |
|
It's not just weights either. It's the KVCache, residual buffer, workspaces for kernels, etc etc. All of it needs to be bound to the local NUMA. KVCache and weights need to be sharded across sockets and you need to do TP with a cross socket allreduce. I don't want to say that's impossible to do in llama cpp but it was way faster just to write all the infrastructure myself in my own engine. Just working on cuda and rocm now btw. I had to design a graph orchestrator and tune the cuda and hip kernels. I have cross vendor TP working (cuda and hip in the same build, doing p2p TP allreduce via pcie). Getting closer. |
|
Do either of you who are bottlenecked by the PCI-E bus have enough RAM to fix ~2x the expert weights in? I have narrowed it down to misalignment, but can't seem to get a |
This would be normal numa distribute + extra copy? I can fit it, for e.g. K2 Q4_X. That is, unless the extra copy is all on one node, that won't fit. |
|
Depends on the model. I have 192g per node but only one node has the GPUs. |
|
So only tested on llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Line 627 in 9a5f577 Needs changing to this: GGML_CALL static void ggml_backend_cuda_buffer_set_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
ggml_cuda_set_device(ctx->device);
if (strstr(tensor->name, "_exps") != nullptr) {
static std::unordered_map<const void*, void*> pinned_cache;
static std::mutex cache_mutex;
void* pinned_copy = nullptr;
{
std::lock_guard<std::mutex> lock(cache_mutex);
auto it = pinned_cache.find(data);
if (it != pinned_cache.end()) {
pinned_copy = it->second;
//fprintf(stderr, "[CACHE HIT] %s: %.2f MB\n", tensor->name, size/1e6);
} else {
CUDA_CHECK(cudaHostAlloc(&pinned_copy, size, cudaHostAllocDefault));
memcpy(pinned_copy, data, size);
pinned_cache[data] = pinned_copy;
//fprintf(stderr, "[CACHE MISS] %s: %.2f MB\n", tensor->name, size/1e6);
}
}
CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, pinned_copy, size, cudaMemcpyHostToDevice, cudaStreamPerThread));
} else {
CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, cudaStreamPerThread));
}
CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
}To use with but also make it skip this: like this: //if (params.only_active_experts) {
// LLAMA_LOG_INFO("XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload\n");
// ggml_backend_sched_set_only_active_experts(ctx->sched, true);
//}(or else it's impossible to cache whole tensors!) You should run as you do do get maximum NUMA throughput, eg: # Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Drop caches (if first time)
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
./llama-server --numa distribute --threads "$(nproc)"or whatever you've found works best for you... The first batch will take ages and you'll see your memory use slowly grow to around double normal. After this you should get full PCI-E throughput until you unload the model and have to do the caching again. It's a bit of a crappy method, but I can't seem to get it to have properly aligned memory any other way (eg: creating aligned tensors in GGUF and memory mapping to 64k or 2MB boundary doesn't work...). |
|
I could probably do that with Deepseek (since that was already fitting on a single node for me), but not for Kimi. If I'm understanding correctly, this would allow us to get the speed benefits of mmap migration without killing the disaggregated prefill speed, at the cost of double the RAM use? |
Yeah, I only get 4.5-4.7GB/s through the PCI-E bus during offloaded PP (and randomly 7.5GB/s about 10-15% of the time for who knows what reason!). By doing this I get 12.7GB/s through the PCI-E bus during offloaded PP (same as the new "Direct IO" method, but sadly I get halved TG and PP for non-offloaded using this as everything ends up on a single NODE and badly laid out for NUMA threads). This essential let's you get the best NUMA TG, best (non-offloaded) NUMA PP and the best offloaded PP, at the cost of double RAM use. |
|
Maybe it's worth me trying the next time I run a model that I have room for double memory cost for lol, been using Kimi K2.5 Q4_X recently which is too large for that. When using both sockets, I noticed was getting better TG with |
|
Here's Kimi K2.5 on ik_llama with your patch. tl;dr - it's effective. This should really be done with a sweep tool. Usually I'm only interested in decode-phase and the CPU-side of MoE (i.e. constant time) so I don't have a go-to sweep script. But, I can ensure the system is warm, runs are the same, and have unique prefixes so they don't hit KV-cache. this is 9115, which only has 2xCCD. I'm not sure what would be expected wrt. PCIe transfers. Improving the overhead of PCIe transfers for offload would benefit smaller batch more, so I chose 4k for smallish. Bonus: sglang/kt-kernel INT4, which has been my running system recently since it supports K2.5 vision and has comparable perf. |
Yeah, don't worry about it! It opens up the hope that there may be a way to align the single copy of the tensor though. It also proves that the QPI link between nodes isn't a bottleneck as I feared might be the case when "Direct IO" got the full PCI-E speed, but with everything on a single node anyway... I've actually turned on Sub-NUMA Clustering for my Xeon Gold 6248 CPUs (get around 5-7% better TG with this), so the second copy is well and truly split up over nodes. |
|
Will it double the entire weights of the model or just the portion that is in sysram? For instance with qwen I have only some 60gb in sysram. That I can comfortably double within the same node. and just to clarify, we need only your code changes or to merge the PR & do the code changes? |
Yeah, I find it hard to get stable results on NUMA, but from my quick tests yesterday running the full Kimi K2.5 (ie: hacked QAT matching My machine for reference:
#!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
export CUDA_VISIBLE_DEVICES=0,1
~/ik_llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "Kimi-K2.5" \
--model ~/models/gguf/Kimi-K2.5-Q4_0_lloyd_max.gguf \
--log-disable \
--jinja \
--chat-template-file ~/models/Kimi-K2.5.jinja \
--mla-use 3 \
--flash-attn 1 \
--n-gpu-layers 99 \
--tensor-split 24,37 \
--numa distribute \
--threads "$(nproc)" \
--override-tensor exps=CPU \
--ctx_size 262144 \
--cache-type-k f16 \
--batch-size 24576 \
--ubatch-size 24576 \
--attention-max-batch 2048 \
--cuda-params offload-batch-size=64 \
--parallel 1 \
--no-cont-batching \
--cache-ram 65536 \
--temp 1.0 \
--min-p 0.01NOTE: To use the 24k batch size I'm having to use ikawrakow/ik_llama.cpp#1191 (ie: Also in case anybody else wants to pick up on this, I found that |
It will just double anything that gets offloaded from RAM to VRAM that gets matched by the
No, it's completely self-contained hack using a static: static std::unordered_map<const void*, void*> pinned_cache;
static std::mutex cache_mutex;this is generally a horrible thing to do, but the code is really just to prove it works for now. I haven't actually testes it on |
|
I probably won't have time to do much more on this as my back is killing me from PC use and I need a few days break... Feel free to try to pick up on this! My next steps would be:
size_t alignment = 64 * 1024; // 64KB
void *temp = mmap(NULL, file->size() + alignment, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (temp == MAP_FAILED) {
throw std::runtime_error(format("mmap failed: %s", strerror(errno)));
}
uintptr_t aligned = ((uintptr_t)temp + alignment - 1) & ~(alignment - 1);
munmap(temp, file->size() + alignment);
addr = mmap((void*)aligned, file->size(), PROT_READ, flags | MAP_FIXED_NOREPLACE, fd, 0);
if (addr == MAP_FAILED) {
throw std::runtime_error(format("mmap failed: %s", strerror(errno)));
}
fprintf(stderr, "***** addr: %p *****\n", addr);can be used to modify: Line 430 in 5fa1c19 GGUF file when it gets mmapped to 64k boundary:
eg: (confirmed working as 0x10000 = 64k) There is a lot that can likely be tried like this. |
|
Also the thread ordering of this: might work better for page faults when say The latter is likely to fault the 4k pages of the huge 3GB tensors over alternate nodes, etc... Worth investigating more IMO. |
|
Interesting to mess with mmap because generally I disable it to get faster speeds. Also put exp layers on GPU in addition to sysram and use only physical cores. Disabling HT caused worse TG performance. |
Yeah, I think every setup need lots of tweaking - I have some other older machines with the previous generation of Xeons in (dual I've tweaked mine to the nth degree now and using
It's actually quite usable in |
|
Do you see difference in your pcm-memory benchmarks? I will try this stuff when I download stepfun. I think it's small enough around 110-120gb. |
|
One thing I've learned is that it's best to turn off memory thermal management in the BIOS. |
In my server that doesn't exist. So I tried both patches for affinity and tensor loading with stepfun, didn't seem to make any difference on speed one way or another. Is this tied to a specific model or having to do full exp=cpu? In my setups I put some layers on the GPUs to help prompt and textgen. Stepfun gives me about 400PP and 40t/s at 4KL. |

This PR adds a new
--numa mirroroption which mirrors model weights to each Numa node on the system, and uses a thread-local var in the OMP threadpool to select the correct mirror copy local to the thread at runtime, to eliminate cross-socket traffic.Build instructions:
To test:
Test system is a two-socket Xeon 6238R Cascade Lake, with 768GB of DDR4-2933 (6 channels per socket).
Without
--numa mirror:With
--numa mirror:Intel PCM tool during mirror inference showing both sockets using local mem:
There's still a bit of cross-socket traffic (5%) because only model weights are mirrored, not tensors created at inference time. I'll play with that, maybe mirroring those aggressively will help too, or maybe not. Right now anything created at inference time just gets set to live on Node 0.