cli: buffering info log, only show if model load failed by ngxson · Pull Request #18236 · ggml-org/llama.cpp

ngxson · 2025-12-20T22:48:28Z

A QoL change that allow having more useful logs in case model loading failed.

Currently, the CLI uses ERROR as the default level, which make debugging quite tricky in some cases (in some cases INFO and WARN are also needed)

This PR introduce a "buffering" API to common_log that allows log lines to still be recorded, but can be dropped or flushed based on model load failed or success.

The behavior:

If the default level is used (ERROR), buffering ERROR+WARN log lines. They will only show if model load failed
Otherwise, if other log level is used, skip buffering (also skip the loading animation) --> real time logging as normal

CC @JohannesGaessler this can be useful for debugging problem with -fit. Although I know DEBUG level is preferred, it can be quite difficult to copy-paste if the log is too verbose. WDYT?

JohannesGaessler · 2025-12-21T10:15:58Z

Prior to calling llama_params_fit this message is printed:

common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on

Ideally a user would see this message and test -fit off. Then, depending on whether or not the issue is resolved they would report it either with -fit on --verbose or -fit off. In practice however, it seems that many users are not reading that message. Printing the held-back info/warning messages could be counter-productive as it would likely push the above message out of the console window. And if the failure only occurs on e.g. the 10th instantiation the default behavior would be to vomit a huge wall of text at the user that would discourage them from trying to interpret it. In terms of usability I think it would be preferable to colorize important messages like the one above. My opinion is that for end users we should prioritize having a high density of information in the error messages so that users are more likely to actually read them.

JohannesGaessler · 2025-12-21T10:25:08Z

For the specific instance where the model doesn't exist I get this output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 7499 (fd05c51ce) with GNU 15.2.1 for Linux x86_64
main: llama backend init
main: load the model and apply lora adapter, if any
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
gguf_init_from_file: failed to open GGUF file 'models/opt/gpt_oss-20b-mxfp4.gguff'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/opt/gpt_oss-20b-mxfp4.gguff
llama_model_load_from_file_impl: failed to load model
llama_params_fit: failed to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0,07 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 19684 MiB free
gguf_init_from_file: failed to open GGUF file 'models/opt/gpt_oss-20b-mxfp4.gguff'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/opt/gpt_oss-20b-mxfp4.gguff
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/opt/gpt_oss-20b-mxfp4.gguff'

The messages describing that the model cannot be loaded are not filtered as they are going to the error log. In my opinion it would in this particular instance not improve usability to also print the info and warning messages. Though if there was a segmentation fault somewhere it may be useful.

ngxson · 2025-12-21T11:40:15Z

My opinion is that for end users we should prioritize having a high density of information in the error messages so that users are more likely to actually read them.

hmm yeah that make sense, I was a bit concern when the CLI doesn't show the full log like all other examples (for aesthetic reason), but it's true that we should show error messages more prominent, e.g. colorize them as you said

ngxson · 2025-12-21T11:59:50Z

what about now? I included a small hack that make sure the message about -fit always show in error log

ngxson · 2025-12-21T12:52:52Z

The messages describing that the model cannot be loaded are not filtered as they are going to the error log.

what I'm thinking is that llama_params_fit could return more specific error type, for example an enum:

LLAMA_FIT_STATUS_OK
LLAMA_FIT_STATUS_FAILED  --> "expected" failures; for example, not having enough RAM
LLAMA_FIT_STATUS_UNEXPECTED --> "unexpected" failures; for example, cannot load model or create llama_ctx

If LLAMA_FIT_STATUS_UNEXPECTED is returned, we skip the rest of the common_init_result and exit

JohannesGaessler · 2025-12-21T20:47:39Z

@ngxson for the screenshot you shared, you already seem to have colorized messages (for what I assume to be llama-cli). On my system with the master branch I get monocolor text. Is this OS-dependent?

ngxson · 2025-12-21T21:32:51Z

tools/cli/cli.cpp

    }

+    // TODO @ngxson: we need this to have colors in log, will it have any side effects?
+    common_log_set_prefix(common_log_main(), true);


@JohannesGaessler the master version does not have this line so it doesn't show the color. Although technically say, the color is automatically enabled if stdout is a TTY.

Not quite sure why the color is tied to this set_prefix, probably a remnant from the past.

JohannesGaessler · 2025-12-21T22:34:53Z

I think the current approaches with buffering logs or returning an enum have a flaw: not all errors are recoverable. If the program segfaults then there is no opportunity to print additional text to the console. I definitely share your (presumed) concern that the current state of the -fit functionality is imposing a more general maintenance burden than just for the feature itself due to additional overhead with bug reports.

We could maybe print the debug log to a file in real time and ask users to upload the log file for a bug report. Though then we would still be relying on user compliance. And IIRC we previously had a feature like that and it seems to have been removed. @ggerganov was there a reason for this or am I simply misremembering?

ngxson · 2025-12-22T11:26:53Z

I think the current approaches with buffering logs or returning an enum have a flaw: not all errors are recoverable

Hmm yeah that's correct.

Indeed, one of my motivations for this buffering mechanism was because the spinner animation in CLI mode can interfere with real-time logging. In worst case, it can inject loading animation frame like - / \ into the log line, which may accidentally turn a positive number into a negative one.

But the fix can be simple, we can switch to using braille dots as animation frames.

Another solution that I want to propose here: we can also add a log line before loading model (and show the loading animation)

[Hint: In case the CLI quits unexpectedly, restart it with -v to see more logs]

Loading model... /

ngxson · 2025-12-22T11:33:25Z

We could maybe print the debug log to a file in real time and ask users to upload the log file for a bug report.

We do have it in the past, no idea at which point it was removed. But I think the main problem is that users now usually run llama-cli in an arbitrary working directory (many users install it system-wide), and creating a log file without user's confirmation can be a bad UX.

We could probably write the log somewhere else, maybe to the cache directory. But IMO we should still ask for user permission as logging can cause wear and tear to SSD if we write large amount of logs each time.

JohannesGaessler · 2025-12-22T11:59:33Z

We could maybe add a CLI arg specifically for debugging where the user provides a path to write the debug log to. Use prefixes like "DEBUG", "INFO", "WARN", and "ERROR" to enable filtering with tools like grep. Adapt the issue template to ask specifically for this output. Though I personally like that as of right now we have the console output already in the issue template, it would be annoying if you had to download a file for each issue.

ngxson · 2025-12-22T12:12:23Z

We could maybe add a CLI arg specifically for debugging where the user provides a path to write the debug log to.

We already have it, it's the --log-file argument

Use prefixes like "DEBUG", "INFO", "WARN", and "ERROR" to enable filtering with tools like grep. Adapt the issue template to ask specifically for this output.

In term of UX, I think we should require some efforts to report a bug, but not too much. Currently, user is already required to fill out the issue form which is a multi-step process. If we require manual filtering the log before uploading, users may skip this step.

I think most users are already familiar with copy-paste terminal results into GH issues because it's the same method for all CLI softwares, so I would prefer sticking with the current method. Probably what's better is that we can have a "reporter" that formats everything into a copy-able block of text whenever we encounter an error (including inside the block can be build number, CLI arg, etc)

Ofc this will still have the same flaw of not covering the unrecoverable cases, like GGML_ASSERT, but if this system works 90% of the time, I think it's still a QoL improvement for both users and maintainers

JohannesGaessler · 2025-12-24T10:53:34Z

We already have it, it's the --log-file argument

Oh okay, I didn't know.

If we require manual filtering the log before uploading, users may skip this step.

I meant that users would upload the full log file that a dev could then download and filter themselves.

Are the following changes as a package acceptable to you:

I extend the return value of llama_params_fit to differentiate between failures of the algorithm and hard errors due to e.g. a file not existing. If the algorithm fails the corresponding message goes to the warning log, on a hard error the message goes to the error log.
You implement buffering of the logs, we print held-back messages if llama_params_fit encounters a hard error. I think it would also make sense to always print held-back message prior to printing an error message to preserve the order of messages.
Currently your implementation is situated in the common API, it seems very similar to what I'm currently doing for llama_params_fit where I'm re-routing messages to the debug log. Long-term my intent is to make llama_params_fit use the API in llama.h so that in principle user code could implement its own logic. It may make sense to refactor and deduplicate our approaches at that time. For now if common is already buffering messages it should run llama_params_fit in such a way that it does not also re-route messages.
We keep observing issues for problems with the current implementation / defaults for -fit with a focus on how users are (mis-)interpreting the information presented to them and how we can reduce maintenance effort for bugs.

ngxson · 2025-12-24T12:28:29Z

I extend the return value of llama_params_fit to differentiate between failures of the algorithm and hard errors due to e.g. a file not existing. If the algorithm fails the corresponding message goes to the warning log, on a hard error the message goes to the error log.

Yes it will be very useful

You implement buffering of the logs, we print held-back messages if llama_params_fit encounters a hard error. I think it would also make sense to always print held-back message prior to printing an error message to preserve the order of messages.

Yes, sounds good. I think this is also useful for -fit off case too

Long-term my intent is to make llama_params_fit use the API in llama.h so that in principle user code could implement its own logic. It may make sense to refactor and deduplicate our approaches at that time.

I think I have an opposite POV on this. Making llama_params_fit to use other llama.h may turn it into more like an utility instead of an API call (quite similar to how libmtmd works currently)

Instead, I think we can allow context and model buffers to be late-allocated. I looked quickly into your llama_get_device_memory_data and I think what can be done to archive that:

After llama_model_load_from_file with no_alloc, we can reuse the partial-loaded model instead of freeing it. We need to add an API to tell how to load the tensors (i.e. which layers go to which backend); the fit function will provide this info.
Note: Such API can be also quite for the recently-added "sleeping" feature on server. My idea is that, we can allow unload only the model weight (not the KV cache) from backend when model in "sleeping" state.
Instead of creating a new context just to get ctx->memory_breakdown(), we can allow late-allocation or re-allocation of llama_context memory via a new API.

So on a high-level, my idea will look like this:

mparams.late_alloc = true; // allow late-allocation
llama_model * model = llama_model_load_from_file(path_model, mparams);
cparams.late_alloc = true; // allow late-allocation
llama_context * ctx = llama_init_from_model(model, cparams);

struct llama_alloc_info ainfo = llama_alloc_fit(ctx, fit_params);
llama_model_late_alloc(model, ainfo);
llama_context_late_alloc(ctx, ainfo);

WDYT about this idea?

We keep observing issues for problems with the current implementation / defaults for -fit with a focus on how users are (mis-)interpreting the information presented to them and how we can reduce maintenance effort for bugs.

Yes, no problem for me. I think improving log handling for -fit can also improve logging for other cases too. Feel free to tag me on issues regarding this.

JohannesGaessler · 2025-12-24T13:39:25Z

For llama_params_fit I added a flag no_alloc that makes it so that only dummy buffers are created and no actual model weights are being loaded. This could in principle be extended to allow loading the weights at a later time. The issue I see for llama_params_fit is that I would need to be able to change things like the context size or tensor -> backend assignments after a llama_model and the corresponding ggml contexts have already been created. I would have to look into how much complexity that would add vs. creating a completely new model/context.

JohannesGaessler · 2025-12-25T22:03:58Z

PR for returning an enum: #18374

mtmcp · 2026-01-01T13:32:38Z

I think this is again showing some of the issues having tight coupling between log and output.

Buffering is a cli-specific problem and yet it requires the notion of buffering added to the log framework doesn't feel right to me.

I am thinking it would be useful to have a simple event loop that handles multiple lightweight events (log event, output event, etc.). These would remove the coupling while at the same time allowing synchronizations like this.

Note: I'm not necessarily suggesting we make this change now just highlighting that we are entering more complex UI interactions might warrant some standard GUI paradigms. Food for thought.

cli: buffering info log, only show if model load failed

66d0909

ngxson requested a review from ggerganov as a code owner December 20, 2025 22:48

ngxson requested a review from CISC December 20, 2025 22:48

github-actions bot added the examples label Dec 20, 2025

loci-dev mentioned this pull request Dec 20, 2025

UPSTREAM PR #18236: cli: buffering info log, only show if model load failed auroralabs-loci/llama.cpp#644

Closed

ngxson added 2 commits December 21, 2025 12:40

Merge branch 'master' into xsn/cli_buffered_logs

f5b989b

only log warn and error

53e45cc

ngxson commented Dec 21, 2025

View reviewed changes

JohannesGaessler mentioned this pull request Dec 25, 2025

llama_fit_params: return enum for fail vs. error #18374

Merged

loci-dev mentioned this pull request Dec 25, 2025

UPSTREAM PR #18374: llama_fit_params: return enum for fail vs. error auroralabs-loci/llama.cpp#705

Open

JohannesGaessler mentioned this pull request Dec 30, 2025

VRAM Offload (Runtime Num-GPU-Layers Change) #18491

Closed

ngxson closed this Mar 26, 2026

Conversation

ngxson commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 21, 2025

Uh oh!

JohannesGaessler commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

ngxson commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 21, 2025

Uh oh!

ngxson Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Dec 21, 2025

Uh oh!

ngxson commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 22, 2025

Uh oh!

JohannesGaessler commented Dec 22, 2025

Uh oh!

ngxson commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 24, 2025

Uh oh!

ngxson commented Dec 24, 2025

Uh oh!

JohannesGaessler commented Dec 24, 2025

Uh oh!

JohannesGaessler commented Dec 25, 2025

Uh oh!

mtmcp commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 20, 2025 •

edited

Loading

ngxson commented Dec 21, 2025 •

edited

Loading

ngxson commented Dec 22, 2025 •

edited

Loading

ngxson commented Dec 22, 2025 •

edited

Loading

mtmcp commented Jan 1, 2026 •

edited

Loading