Skip to content

UPSTREAM PR #18322: server: (preset) add unsafe-allow-api-override#673

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18322-branch_ngxson-xsn/server_router_overrides
Open

UPSTREAM PR #18322: server: (preset) add unsafe-allow-api-override#673
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18322-branch_ngxson-xsn/server_router_overrides

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18322

Ref discussion: ggml-org/llama.cpp#18261 (comment)

@ServeurpersoCom I think we need to add a test with INI preset at some point

Example preset for this PR:

[THUDM/glm-edge-v-5b-gguf:Q4_K_M]
no-mmap = 0
temp = 123.000
autoload = 1
unsafe-allow-api-override = no-mmap,c

And API request:

{
        "model": "THUDM/glm-edge-v-5b-gguf:Q4_K_M",
        "overrides": {"c": "512"}
}

Returns:

{
    "success": true,
    "args": [
        "........../build/bin/llama-server",
        "--host",
        "127.0.0.1",
        "--mmap",
        "--port",
        "65054",
        "--temp",
        "123.000",
        "--alias",
        "THUDM/glm-edge-v-5b-gguf:Q4_K_M",
        "--ctx-size",
        "512",
        "--hf-repo",
        "THUDM/glm-edge-v-5b-gguf:Q4_K_M"
    ]
}

@loci-review
Copy link

loci-review bot commented Dec 23, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #673: Server Router API Override Feature

This PR introduces a security-controlled mechanism for dynamic parameter overrides via the /models/load API endpoint. The changes affect 7 files with modifications primarily in server router and preset management subsystems.

Key Findings

Performance-Critical Areas Impact:

The changes do not affect core inference functions. No modifications were made to llama_decode, llama_encode, llama_tokenize, or sampling operations. All performance changes occur in initialization and configuration paths outside the inference pipeline.

Tokens Per Second Impact:

Zero impact on inference throughput. The functions responsible for tokenization and inference remain unchanged. Based on the reference model (smollm:135m on 12th Gen Intel i7-1255U), which shows 7% tokens per second reduction when llama_decode is 2 ms slower, this PR introduces no inference degradation as llama_decode response time is unaffected.

Impacted Functions:

The function common_params_add_preset_options shows +21,691 ns response time increase in llama-tts and +20,680 ns in llama-cvector-generator. However, this function executes only during model initialization, not during inference. The absolute overhead of 21 microseconds occurs once per model load operation.

New function common_preset_context::load_from_map adds 1-5 microseconds overhead when API overrides are provided, with zero overhead when the override map is empty (fast path).

Power Consumption Analysis:

Binary-level analysis shows llama-cvector-generator increased by 586 nJ (+0.23%) and llama-tts decreased by 164 nJ (-0.06%). These changes are within measurement noise and reflect initialization path modifications rather than inference loop changes. Core libraries (libggml-base.so, libggml-cpu.so, libllama.so) show zero power consumption change, confirming inference paths remain unaffected.

Code Changes:

The implementation adds whitelist-based parameter override validation, refactors preset parsing logic into load_from_map for code reuse, and extends the /models/load endpoint to accept override parameters. The security model requires explicit whitelisting via unsafe-allow-api-override preset parameter, with type validation ensuring only string values are accepted.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 5bb9d21 to 1946e3d Compare December 28, 2025 01:38
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 76fc6ba to 945c525 Compare January 2, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants