llama: offload output layer to GPU first by JohannesGaessler · Pull Request #18148 · ggml-org/llama.cpp

JohannesGaessler · 2025-12-17T20:59:21Z

As of right now llama.cpp first moves all repeating layers from RAM to VRAM before finally moving the non-repeating output layer. However, it seems to be better to instead move the output layer first and to move the repeating layers afterwards. Specifically:

The memory is allocated as contiguous blocks that grow monotonically as --n-gpu-layers is increased. So for a graph evaluation the backend scheduler needs fewer splits.
It is more memory efficient to move the largest tensor (the output tensor) first because that way the inputs for the splits in the backend scheduler are smaller.
As of right now llama_params_fit does not handle the case correctly where there is sufficient total VRAM for a dense model but the layers need to be rebalanced vs. the initial guess that llama.cpp produces with -fit off. It becomes by comparison much easier to handle if the output layer is being offloaded first because then there is no sudden change to memory allocation on the last layer.
Presumably it will be easier to take advantage of sampling : add support for backend sampling #17004 once it's been merged.
Long-term offloading the output layer first will be more convenient for training with partial GPU layers.

Benchmark

Because this PR changes the memory use for a given -ngl value I collected data like this:

For each -ngl value, run llama-perplexity and note down the "self" memory use as the memory use for a context of size 512.
For each -ngl value, run llama-bench and note down the performance numbers.
Plot performance vs. memory use.
Repeat for output layer first and output layer last.

Results

For pp512 there is basically no change. For tg128 however, there is a sizable improvement however in the performance at a given VRAM use though.

ggerganov

Likely fixes #18107

NeoZhangJianyu · 2026-01-16T02:57:50Z

@JohannesGaessler
Hi,
This PR impact the issue: #9241 (comment).
It has been narrowed down to this PR by @aaricantto

Could you help check it? I guess it will impact all backends.

Thank you!

JohannesGaessler · 2026-01-16T06:51:34Z

This PR did impact all backends but if you cannot move the output layer to VRAM that is a SYCL bug and needs to fixed regardless. I don't even have any Intel hardware so I don't know how I would be of any help for that.

NeoZhangJianyu · 2026-01-20T05:00:53Z

This PR did impact all backends but if you cannot move the output layer to VRAM that is a SYCL bug and needs to fixed regardless. I don't even have any Intel hardware so I don't know how I would be of any help for that.

OK, I will check and fix it.

Thank you!

llama: offload output layer to GPU first

b622d1c

JohannesGaessler requested review from CISC and ggerganov as code owners December 17, 2025 20:59

ggerganov approved these changes Dec 18, 2025

View reviewed changes

JohannesGaessler merged commit 57c1e05 into ggml-org:master Dec 18, 2025
69 of 71 checks passed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 19, 2025

revert broken change from ggml-org#18148 pending fix

e2f4449

This was referenced Dec 21, 2025

Eval bug: Major performance drop since b7406 #18258

Closed

llama-fit-params: fix step size for last device #18415

Merged

loci-dev mentioned this pull request Dec 27, 2025

UPSTREAM PR #18415: llama-fit-params: fix step size for last device auroralabs-loci/llama.cpp#724

Open

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

llama: offload output layer to GPU first (ggml-org#18148)

d02fc56

aaricantto mentioned this pull request Jan 16, 2026

Bug: SYCL error #9241

Closed

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

llama: offload output layer to GPU first (#18148)

1b23693

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: offload output layer to GPU first#18148

llama: offload output layer to GPU first#18148
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-offload-output-first

JohannesGaessler commented Dec 17, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

NeoZhangJianyu commented Jan 16, 2026

Uh oh!

JohannesGaessler commented Jan 16, 2026

Uh oh!

NeoZhangJianyu commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NeoZhangJianyu commented Jan 16, 2026

Uh oh!

JohannesGaessler commented Jan 16, 2026

Uh oh!

NeoZhangJianyu commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohannesGaessler commented Dec 17, 2025 •

edited

Loading