graph : avoid huge warm-up graphs for MoE models by ggerganov · Pull Request #14753 · ggml-org/llama.cpp

ggerganov · 2025-07-18T08:47:28Z

Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra GGML_OP_ADD nodes for aggregating the results.

ggml-ci

ggerganov · 2025-07-18T08:56:38Z

src/llama-context.cpp


 uint32_t llama_context::graph_max_nodes() const {
-    return std::max<uint32_t>(65536u, 5u*model.n_tensors());
+    return std::max<uint32_t>(1024u, 6u*model.n_tensors());


We should probably bump this up to 8u*model.n_tensors() just to be safe.

slaren · 2025-07-18T09:01:55Z

If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571.

ggerganov · 2025-07-18T10:53:05Z

I think the experts ~~are still mapped~~ continue to be loaded because when we run the ggml_mul_mat_id() calls, we use the large n_expert_used == hparams.n_expert, instead of the original hparams.n_expert_used. So for example this call, during warmup would still load all the experts into memory and perform the warmup:

llama.cpp/src/llama-graph.cpp

Lines 867 to 870 in 033b306

    
           ggml_tensor * up = build_lora_mm_id(up_exps, cur, selected_experts); // [n_ff, n_expert_used, n_tokens] 
        
           cb(up, "ffn_moe_up", il);

The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes.

For reference, here is the n_expert_used initialization:

llama.cpp/src/llama-graph.cpp

Lines 512 to 514 in 033b306

    
           n_expert         (hparams.n_expert), 
        
           n_expert_used    (cparams.warmup ? hparams.n_expert : hparams.n_expert_used), 
        
           freq_base        (cparams.rope_freq_base),

Edit: fixed wording at the start for clarity

fix regression during finetune on Llama-3.2-1B-F32: GGML_ASSERT(cgraph->n_nodes < cgraph->size) failed git bisect applying the most recent finetune (SGD) change showed that d498af3 Georgi Gerganov 2025-07-18 14:31:15 +0300 graph : avoid huge warm-up graphs for MoE models (ggml-org#14753) which greatly decreased graph_max_nodes has been responsible for finetune failing on reasonably sized models for the past two months. partially reverting the decrease (maybe larger models still fail) note: env LLAMA_SET_ROWS=0 is needed also or else: GML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed (the node->op in question is indeed a rows op) unfortunately a git revert on: 8a4280c Georgi Gerganov 2025-08-28 12:27:02 +0300 kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) is not straightforward, so this branch is behind that.

* graph : avoid huge warm-up graphs for MoE models ggml-ci * cont : bump max nodes to 8x model tensors

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4feb0bf to 4c1bacb Compare July 18, 2025 08:47

ggerganov requested a review from slaren July 18, 2025 08:48

graph : avoid huge warm-up graphs for MoE models

033b306

ggml-ci

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4c1bacb to 033b306 Compare July 18, 2025 08:55

ggerganov commented Jul 18, 2025

View reviewed changes

cont : bump max nodes to 8x model tensors

5883f01

slaren approved these changes Jul 18, 2025

View reviewed changes

ggerganov merged commit d498af3 into master Jul 18, 2025
47 checks passed

ggerganov deleted the gg/context-reduce-min-nodes branch July 18, 2025 11:31

saood06 mentioned this pull request Aug 14, 2025

Enable CUDA graphs for MoE models + GPT-OSS support ikawrakow/ik_llama.cpp#689

Merged

graehl mentioned this pull request Sep 18, 2025

Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend #15779

Closed

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

graph : avoid huge warm-up graphs for MoE models (#14753)

c51fb5b

* graph : avoid huge warm-up graphs for MoE models ggml-ci * cont : bump max nodes to 8x model tensors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : avoid huge warm-up graphs for MoE models#14753

graph : avoid huge warm-up graphs for MoE models#14753
ggerganov merged 2 commits intomasterfrom
gg/context-reduce-min-nodes

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Jul 18, 2025 •

edited

Loading