Skip to content

UPSTREAM PR #17911: cli: enable jinja by default#515

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17911-branch_ngxson-xsn/cli_jinja_default
Open

UPSTREAM PR #17911: cli: enable jinja by default#515
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17911-branch_ngxson-xsn/cli_jinja_default

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17911

enable jinja by default for: server and CLI

disabled by default for: mtmd-cli and llama-completion

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Pull Request #515 Performance Review

PR Title: cli: enable jinja by default
Change Scope: 4 files modified (8 additions, 9 deletions)

Summary

This PR changes the default value of use_jinja from false to true in common_params structure, enabling Jinja template processing by default for chat operations across CLI and server tools. The change removes example-specific initialization logic from common_params_parser_init() and adds explicit overrides in completion.cpp and mtmd-cli.cpp to maintain their existing behavior (Jinja disabled). The modifications affect template instantiation paths in STL containers and JSON parsing operations, resulting in micro-level performance variations in non-critical utility functions.

Analysis

Code Changes:

  • common/common.h: Changed bool use_jinja = false to bool use_jinja = true (line 467)
  • common/arg.cpp: Removed 6 lines of example-specific initialization logic that set params.use_jinja = true for LLAMA_EXAMPLE_SERVER
  • tools/completion/completion.cpp: Added explicit params.use_jinja = false before parameter parsing
  • tools/mtmd/mtmd-cli.cpp: Added explicit params.use_jinja = false before parameter parsing

Performance Impact:

The observed performance variations occur in STL template instantiations and JSON operations within llama-tts and llama-cvector-generator binaries. These functions are not part of the inference pipeline and do not affect tokenization or model execution:

  • Vector iterator operations show 60-226% throughput changes with absolute deltas of 24-135 ns
  • JSON operations show 27-174% throughput changes with absolute deltas of 80-121 ns
  • All affected functions are utility operations for parameter handling, file management, and HTTP client operations

Inference Impact:

No functions in the core inference pipeline are affected. The following critical functions show zero performance change:

  • llama_decode - unchanged
  • llama_encode - unchanged
  • llama_tokenize - unchanged
  • ggml_mul_mat - unchanged
  • llama_graph_compute - unchanged

Tokens per second impact: None. The performance variations are isolated to initialization and parameter parsing code paths that execute once at startup, not during token generation.

Power Consumption:

  • llama-tts: +0.094% (+239 nJ)
  • llama-cvector-generator: -0.064% (-160 nJ)
  • All inference libraries (libllama.so, libggml-base.so, libggml-cpu.so): 0% change

The power consumption changes are within measurement noise and reflect the cumulative effect of STL template instantiation differences during parameter initialization, not runtime inference operations.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from c05b224 to e70bc15 Compare December 14, 2025 08:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 81e654d to c785ce2 Compare December 18, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants