UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources by loci-dev · Pull Request #614 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-18T14:39:57Z

Before this PR, the logic for loading models from different sources (cache / local / custom ini) was quite messy and doesn't allow ini preset to take precedence over other sources.

With this PR, we unify the method for loading server models and presets:

preset.cpp is responsible for collecting all model sources (cache / local) and generate a base preset for each of the known GGUF
preset.cpp then load INI and parse the global section ([*])
it is then up to downstream code (e.g. server-models.cpp) to decide how to cascade these presets

The current cascading rule can be found in server's docs:

Command-line arguments passed to llama-server (highest priority)
Model-specific options defined in the preset file (e.g. [ggml-org/MY-MODEL...])
Global options defined in the preset file ([*])

loci-review · 2025-12-18T15:30:33Z

Explore the complete analysis inside the Version Insights

Pull Request #614 Performance Analysis Summary

Overview

PR #614 implements a preset configuration system refactoring for the llama-server multi-model router. The changes introduce a unified preset loading mechanism with cascading configuration support, affecting 6 files with 347 additions and 260 deletions. Performance analysis reveals no material impact on inference operations.

Key Findings

Performance Impact on Inference:
No inference-related functions were modified. The functions showing performance variations (std::vector iterators, std::chrono::duration_cast) are utility operations in configuration parsing code, not in the inference pipeline. Functions critical to tokens-per-second performance (llama_decode, llama_encode, llama_tokenize) remain unchanged with 0 ns delta in response time and throughput.

Affected Components:

Configuration Management: New common_preset_context class centralizes preset loading from cache, local directories, and INI files
Server Model Loading: Refactored server_models::load_models() implements three-tier cascading: global preset → model-specific preset → CLI arguments
Argument Parsing: Added boolean argument handling in common_params_to_map() for negative flags

STL Iterator Performance Variations:
Analysis identified 145-226% response time increases in std::vector iterator functions within llama-cvector-generator (199 ns vs 81 ns baseline) and corresponding improvements in llama-tts (84 ns vs 199 ns baseline). These variations stem from compiler inlining decisions in non-inference binaries. Absolute impact: 118 ns per call in configuration parsing code executed once at startup.

Power Consumption:
Binary-level analysis shows negligible changes: llama-cvector-generator (-0.03%), llama-tts (+0.028%). All other binaries including libllama.so (0.0% change) show no measurable power consumption variation.

Code Changes Analysis:
The refactoring introduces structured preset management without modifying inference kernels. New functions (load_from_cache(), load_from_models_dir(), cascade()) operate during server initialization, not during token generation. The unset_reserved_args() helper ensures router-controlled parameters are properly isolated from model-specific configurations.

Conclusion:
This refactoring improves configuration management architecture without affecting inference performance. The observed STL iterator variations are compiler artifacts in startup code paths, not runtime inference operations. Tokens-per-second performance remains unchanged as no tokenization or decoding functions were modified.

loci-review · 2025-12-18T17:42:24Z

Explore the complete analysis inside the Version Insights

ngxson added 6 commits December 18, 2025 13:46

presets: refactor, allow cascade presets from different sources

4e475a8

update docs

5abab16

fix neg arg handling

0d04bba

fix empty mmproj

60ec94e

also filter out server-controlled args before to_ini()

4004c47

Merge branch 'master' into xsn/refactor_server_preset

7cbb2d2

loci-dev temporarily deployed to PROD__AL_DEMO December 18, 2025 14:40 — with GitHub Actions Inactive

skip loading custom_models if not specified

ac6f8ca

loci-dev temporarily deployed to PROD__AL_DEMO December 18, 2025 16:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from ab5b02c to 2f30a3d Compare December 18, 2025 17:11

loci-dev force-pushed the main branch 18 times, most recently from c8dcfe6 to ac107ae Compare December 21, 2025 18:11

loci-dev force-pushed the main branch 30 times, most recently from 8754d0f to 8645b59 Compare December 28, 2025 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources#614

UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources#614
loci-dev wants to merge 7 commits intomainfrom
upstream-PR18169-branch_ngxson-xsn/refactor_server_preset

loci-dev commented Dec 18, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 18, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Pull Request #614 Performance Analysis Summary

Overview

Key Findings

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants