Skip to content

UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources#614

Open
loci-dev wants to merge 7 commits intomainfrom
upstream-PR18169-branch_ngxson-xsn/refactor_server_preset
Open

UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources#614
loci-dev wants to merge 7 commits intomainfrom
upstream-PR18169-branch_ngxson-xsn/refactor_server_preset

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18169

Alternative to ggml-org/llama.cpp#17959

Fix ggml-org/llama.cpp#17948

Before this PR, the logic for loading models from different sources (cache / local / custom ini) was quite messy and doesn't allow ini preset to take precedence over other sources.

With this PR, we unify the method for loading server models and presets:

  • preset.cpp is responsible for collecting all model sources (cache / local) and generate a base preset for each of the known GGUF
  • preset.cpp then load INI and parse the global section ([*])
  • it is then up to downstream code (e.g. server-models.cpp) to decide how to cascade these presets

The current cascading rule can be found in server's docs:

  1. Command-line arguments passed to llama-server (highest priority)
  2. Model-specific options defined in the preset file (e.g. [ggml-org/MY-MODEL...])
  3. Global options defined in the preset file ([*])

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Pull Request #614 Performance Analysis Summary

Overview

PR #614 implements a preset configuration system refactoring for the llama-server multi-model router. The changes introduce a unified preset loading mechanism with cascading configuration support, affecting 6 files with 347 additions and 260 deletions. Performance analysis reveals no material impact on inference operations.

Key Findings

Performance Impact on Inference:
No inference-related functions were modified. The functions showing performance variations (std::vector iterators, std::chrono::duration_cast) are utility operations in configuration parsing code, not in the inference pipeline. Functions critical to tokens-per-second performance (llama_decode, llama_encode, llama_tokenize) remain unchanged with 0 ns delta in response time and throughput.

Affected Components:

  • Configuration Management: New common_preset_context class centralizes preset loading from cache, local directories, and INI files
  • Server Model Loading: Refactored server_models::load_models() implements three-tier cascading: global preset → model-specific preset → CLI arguments
  • Argument Parsing: Added boolean argument handling in common_params_to_map() for negative flags

STL Iterator Performance Variations:
Analysis identified 145-226% response time increases in std::vector iterator functions within llama-cvector-generator (199 ns vs 81 ns baseline) and corresponding improvements in llama-tts (84 ns vs 199 ns baseline). These variations stem from compiler inlining decisions in non-inference binaries. Absolute impact: 118 ns per call in configuration parsing code executed once at startup.

Power Consumption:
Binary-level analysis shows negligible changes: llama-cvector-generator (-0.03%), llama-tts (+0.028%). All other binaries including libllama.so (0.0% change) show no measurable power consumption variation.

Code Changes Analysis:
The refactoring introduces structured preset management without modifying inference kernels. New functions (load_from_cache(), load_from_models_dir(), cascade()) operate during server initialization, not during token generation. The unset_reserved_args() helper ensures router-controlled parameters are properly isolated from model-specific configurations.

Conclusion:
This refactoring improves configuration management architecture without affecting inference performance. The observed STL iterator variations are compiler artifacts in startup code paths, not runtime inference operations. Tokens-per-second performance remains unchanged as no tokenization or decoding functions were modified.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

@loci-dev loci-dev force-pushed the main branch 18 times, most recently from c8dcfe6 to ac107ae Compare December 21, 2025 18:11
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 8754d0f to 8645b59 Compare December 28, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants