Skip to content

llama-fit-params: lower ctx size for multi GPU#18101

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-fp-fix-multi-gpu-ctx
Dec 16, 2025
Merged

llama-fit-params: lower ctx size for multi GPU#18101
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-fp-fix-multi-gpu-ctx

Conversation

@JohannesGaessler
Copy link
Contributor

Fixes #18097 .

For multiple GPUs the context size reduction on master seems to be too optimistic:

// for multiple devices we need to be more conservative in terms of how much context we think can fit:
//   - for dense models only whole layers can be assigned to devices
//   - for MoE models only whole tensors can be assigned to devices, which we estimate to be <= 1/3 of a layer
//   - on average we expect a waste of 0.5 layers/tensors per device
//   - use slightly more than the expected average for nd devices to be safe

@verygreen
Copy link

This works, thank you very much.

The new suggestion is

-c 176384 -ngl 94 -ts 10,8,7,34,35 -ot blk\.9\.ffn_(gate|down).*=CUDA1,blk\.17\.ffn_(up|gate|down).*=CUDA2,blk\.58\.ffn_(gate|down).*=CUDA4

which fits with some space to spare, but I understand the VRAM waste is unavoidable because we cannot really split things on a byte boundary

@JohannesGaessler
Copy link
Contributor Author

I didn't ping you yet because I wanted to rebase on top of the other fixes first but glad to hear it's working regardless.

@JohannesGaessler JohannesGaessler merged commit 9dcac6c into ggml-org:master Dec 16, 2025
67 of 71 checks passed
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: fit-params with no context sometimes overshoots VRAM

3 participants