UPSTREAM PR #18322: server: (preset) add unsafe-allow-api-override#673
UPSTREAM PR #18322: server: (preset) add unsafe-allow-api-override#673
unsafe-allow-api-override#673Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryPR #673: Server Router API Override Feature This PR introduces a security-controlled mechanism for dynamic parameter overrides via the Key FindingsPerformance-Critical Areas Impact: The changes do not affect core inference functions. No modifications were made to llama_decode, llama_encode, llama_tokenize, or sampling operations. All performance changes occur in initialization and configuration paths outside the inference pipeline. Tokens Per Second Impact: Zero impact on inference throughput. The functions responsible for tokenization and inference remain unchanged. Based on the reference model (smollm:135m on 12th Gen Intel i7-1255U), which shows 7% tokens per second reduction when llama_decode is 2 ms slower, this PR introduces no inference degradation as llama_decode response time is unaffected. Impacted Functions: The function common_params_add_preset_options shows +21,691 ns response time increase in llama-tts and +20,680 ns in llama-cvector-generator. However, this function executes only during model initialization, not during inference. The absolute overhead of 21 microseconds occurs once per model load operation. New function common_preset_context::load_from_map adds 1-5 microseconds overhead when API overrides are provided, with zero overhead when the override map is empty (fast path). Power Consumption Analysis: Binary-level analysis shows llama-cvector-generator increased by 586 nJ (+0.23%) and llama-tts decreased by 164 nJ (-0.06%). These changes are within measurement noise and reflect initialization path modifications rather than inference loop changes. Core libraries (libggml-base.so, libggml-cpu.so, libllama.so) show zero power consumption change, confirming inference paths remain unaffected. Code Changes: The implementation adds whitelist-based parameter override validation, refactors preset parsing logic into load_from_map for code reuse, and extends the /models/load endpoint to accept override parameters. The security model requires explicit whitelisting via unsafe-allow-api-override preset parameter, with type validation ensuring only string values are accepted. |
5bb9d21 to
1946e3d
Compare
76fc6ba to
945c525
Compare
Mirrored from ggml-org/llama.cpp#18322
Ref discussion: ggml-org/llama.cpp#18261 (comment)
@ServeurpersoCom I think we need to add a test with INI preset at some point
Example preset for this PR:
And API request:
Returns: