Skip to content

llama: offload output layer to GPU first#18148

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-offload-output-first
Dec 18, 2025
Merged

llama: offload output layer to GPU first#18148
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-offload-output-first

Conversation

@JohannesGaessler
Copy link
Contributor

@JohannesGaessler JohannesGaessler commented Dec 17, 2025

Fixes #18119 .

As of right now llama.cpp first moves all repeating layers from RAM to VRAM before finally moving the non-repeating output layer. However, it seems to be better to instead move the output layer first and to move the repeating layers afterwards. Specifically:

  • The memory is allocated as contiguous blocks that grow monotonically as --n-gpu-layers is increased. So for a graph evaluation the backend scheduler needs fewer splits.
  • It is more memory efficient to move the largest tensor (the output tensor) first because that way the inputs for the splits in the backend scheduler are smaller.
  • As of right now llama_params_fit does not handle the case correctly where there is sufficient total VRAM for a dense model but the layers need to be rebalanced vs. the initial guess that llama.cpp produces with -fit off. It becomes by comparison much easier to handle if the output layer is being offloaded first because then there is no sudden change to memory allocation on the last layer.
  • Presumably it will be easier to take advantage of sampling : add support for backend sampling #17004 once it's been merged.
  • Long-term offloading the output layer first will be more convenient for training with partial GPU layers.

Benchmark

Because this PR changes the memory use for a given -ngl value I collected data like this:

  • For each -ngl value, run llama-perplexity and note down the "self" memory use as the memory use for a context of size 512.
  • For each -ngl value, run llama-bench and note down the performance numbers.
  • Plot performance vs. memory use.
  • Repeat for output layer first and output layer last.
Results output_last_first_pp512 output_last_first_tg128

For pp512 there is basically no change. For tg128 however, there is a sizable improvement however in the performance at a given VRAM use though.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely fixes #18107

@JohannesGaessler JohannesGaessler merged commit 57c1e05 into ggml-org:master Dec 18, 2025
69 of 71 checks passed
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 19, 2025
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
@aaricantto aaricantto mentioned this pull request Jan 16, 2026
@NeoZhangJianyu
Copy link
Contributor

@JohannesGaessler
Hi,
This PR impact the issue: #9241 (comment).
It has been narrowed down to this PR by @aaricantto

Could you help check it? I guess it will impact all backends.

Thank you!

@JohannesGaessler
Copy link
Contributor Author

This PR did impact all backends but if you cannot move the output layer to VRAM that is a SYCL bug and needs to fixed regardless. I don't even have any Intel hardware so I don't know how I would be of any help for that.

@NeoZhangJianyu
Copy link
Contributor

This PR did impact all backends but if you cannot move the output layer to VRAM that is a SYCL bug and needs to fixed regardless. I don't even have any Intel hardware so I don't know how I would be of any help for that.

OK, I will check and fix it.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Multi-GPU fit producing weird results

3 participants