graph : avoid huge warm-up graphs for MoE models#14753
Conversation
4feb0bf to
4c1bacb
Compare
4c1bacb to
033b306
Compare
src/llama-context.cpp
Outdated
|
|
||
| uint32_t llama_context::graph_max_nodes() const { | ||
| return std::max<uint32_t>(65536u, 5u*model.n_tensors()); | ||
| return std::max<uint32_t>(1024u, 6u*model.n_tensors()); |
There was a problem hiding this comment.
We should probably bump this up to 8u*model.n_tensors() just to be safe.
|
If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571. |
|
I think the experts Lines 867 to 870 in 033b306 The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes. For reference, here is the Lines 512 to 514 in 033b306 Edit: fixed wording at the start for clarity |
fix regression during finetune on Llama-3.2-1B-F32: GGML_ASSERT(cgraph->n_nodes < cgraph->size) failed git bisect applying the most recent finetune (SGD) change showed that d498af3 Georgi Gerganov 2025-07-18 14:31:15 +0300 graph : avoid huge warm-up graphs for MoE models (ggml-org#14753) which greatly decreased graph_max_nodes has been responsible for finetune failing on reasonably sized models for the past two months. partially reverting the decrease (maybe larger models still fail) note: env LLAMA_SET_ROWS=0 is needed also or else: GML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed (the node->op in question is indeed a rows op) unfortunately a git revert on: 8a4280c Georgi Gerganov 2025-08-28 12:27:02 +0300 kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) is not straightforward, so this branch is behind that.
* graph : avoid huge warm-up graphs for MoE models ggml-ci * cont : bump max nodes to 8x model tensors
Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra
GGML_OP_ADDnodes for aggregating the results.