llama-fit-params: lower ctx size for multi GPU by JohannesGaessler · Pull Request #18101 · ggml-org/llama.cpp

JohannesGaessler · 2025-12-16T15:41:30Z

For multiple GPUs the context size reduction on master seems to be too optimistic:

// for multiple devices we need to be more conservative in terms of how much context we think can fit:
//   - for dense models only whole layers can be assigned to devices
//   - for MoE models only whole tensors can be assigned to devices, which we estimate to be <= 1/3 of a layer
//   - on average we expect a waste of 0.5 layers/tensors per device
//   - use slightly more than the expected average for nd devices to be safe

verygreen · 2025-12-16T16:02:57Z

This works, thank you very much.

The new suggestion is

-c 176384 -ngl 94 -ts 10,8,7,34,35 -ot blk\.9\.ffn_(gate|down).*=CUDA1,blk\.17\.ffn_(up|gate|down).*=CUDA2,blk\.58\.ffn_(gate|down).*=CUDA4

which fits with some space to spare, but I understand the VRAM waste is unavoidable because we cannot really split things on a byte boundary

JohannesGaessler · 2025-12-16T16:05:14Z

I didn't ping you yet because I wanted to rebase on top of the other fixes first but glad to hear it's working regardless.

llama-fit-params: lower ctx size for multi GPU

92175d3

JohannesGaessler requested a review from ggerganov as a code owner December 16, 2025 15:41

ggerganov approved these changes Dec 16, 2025

View reviewed changes

JohannesGaessler merged commit 9dcac6c into ggml-org:master Dec 16, 2025
67 of 71 checks passed

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

llama-fit-params: lower ctx size for multi GPU (ggml-org#18101)

3d1c45a

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

llama-fit-params: lower ctx size for multi GPU (#18101)

0addf7b

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-fit-params: lower ctx size for multi GPU#18101

llama-fit-params: lower ctx size for multi GPU#18101
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-fp-fix-multi-gpu-ctx

JohannesGaessler commented Dec 16, 2025

Uh oh!

verygreen commented Dec 16, 2025

Uh oh!

JohannesGaessler commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Dec 16, 2025

Uh oh!

verygreen commented Dec 16, 2025

Uh oh!

JohannesGaessler commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants