UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple by loci-dev · Pull Request #1133 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-02T06:50:18Z

Note

Source pull request: ggml-org/llama.cpp#19261

fix #19231

For the spec-simple method, we don't need to keep track of the last length to rate-limit the generations. We can simply use an incremental counter. This makes the speculator work with "Regenerate" of last message or branching the conversation from previous messages.

loci-review · 2026-02-02T08:11:18Z

Overview

Analysis of commit a104cff ("spec : fix the check-rate logic of ngram-simple") across llama.cpp reveals minimal performance impact. Of 115,425 total functions, only 2 were modified (0.0017%), with no new or removed functions. The changes fix a correctness issue in n-gram speculative decoding, introducing a 29ns throughput time increase (+5.03%) while maintaining response time impact at 0.11%.

Power Consumption by Binary:

build.bin.llama-tts: 360,886.57 nJ (+0.001%)
build.bin.llama-cvector-generator: 355,773.62 nJ (+0.001%)
build.bin.libllama.so: 249,105.58 nJ (-0.0%)
build.bin.libmtmd.so: 179,022.45 nJ (0.0%)
build.bin.llama-tokenize: 38,524.70 nJ (0.0%)
build.bin.llama-quantize: 43,714.74 nJ (0.0%)
build.bin.llama-qwen2vl-cli: 277.24 nJ (0.0%)
build.bin.llama-gemma3-cli: 277.24 nJ (0.0%)
build.bin.llama-gguf-split: 40,060.05 nJ (0.0%)
build.bin.llama-llava-cli: 277.24 nJ (0.0%)
build.bin.llama-minicpmv-cli: 277.24 nJ (0.0%)
build.bin.libggml.so: 5,124.39 nJ (0.0%)
build.bin.libggml-cpu.so: 157,685.86 nJ (0.0%)
build.bin.libggml-base.so: 73,208.69 nJ (0.0%)
build.bin.llama-bench: 60,119.52 nJ (0.0%)

Function Analysis

common_ngram_simple_draft (build.bin.llama-tts, build.bin.llama-cvector-generator):

Throughput: 575ns → 604ns (+29ns, +5.03%)
Response: 26,398ns → 26,427ns (+29ns, +0.11%)

The function refactored check-rate logic from position-based (idx_last_check + check_rate > cur_len) to counter-based (check_id++ >= check_rate), fixing unpredictable pattern matching behavior in speculative decoding. The 29ns increase results from additional counter increment/comparison operations. Changes improve code correctness and maintainability while maintaining negligible impact on overall inference performance (<0.03% of typical 10-100ms token generation time). The identical throughput and response time changes confirm the impact is isolated to the function body with no propagation to called functions.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-02T11:24:29Z

Overview

Analysis of 115,469 functions across 15 binaries reveals minimal performance impact from speculative decoding refactoring. Modified: 80 functions (0.07%), new: 44, removed: 55, unchanged: 115,290.

Power consumption changes:

build.bin.llama-tts: -0.082% (-295 nJ)
build.bin.llama-cvector-generator: -0.102% (-364 nJ)
build.bin.libmtmd.so, build.bin.libllama.so: <0.001%
build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0%

Function Analysis

common_speculative_state_ngram_simple::draft() (both binaries): Intentional optimization adding check-rate throttling. Throughput time increased +79% (+61ns) due to added counter logic, but response time improved -1% (-284ns to -294ns) by reducing frequency of expensive O(n²) pattern matching. Net positive optimization.

std::match_results::_M_establish_failed_match (llama-tts): Throughput time improved -53% (-81ns), response time -4% (-81ns). Benefit from std::regex::optimize flag adoption in codebase, improving regex automaton efficiency.

Standard library functions (various): Mixed compiler optimization effects. Improvements include std::_Rb_tree::end() (-75% throughput), json_sax_dom_callback_parser::null() (-75% throughput), std::_Function_handler::_M_invoke (-46% throughput). Regressions include std::vector::end() (+307% throughput, +183ns), nlohmann::json::get() (+307% throughput, +183ns), std::make_shared (+207% throughput, +130ns). All changes are in non-critical paths (initialization, text preprocessing, configuration loading) with absolute impacts under 200ns.

Additional Findings

Zero impact on performance-critical inference paths: matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends remain unchanged. Changes isolated to CPU-side speculative decoding utilities in common library. No GPU/ML operations affected.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

spec : fix the check-rate logic of ngram-simple

a104cff

loci-dev temporarily deployed to PROD__AL_DEMO February 2, 2026 06:50 — with GitHub Actions Inactive

cont : refactor + fix checks

b3fa165

loci-dev force-pushed the main branch from 36c499e to 40ccb9a Compare February 2, 2026 07:28

loci-dev force-pushed the main branch 2 times, most recently from d9cffb7 to 1e94f5e Compare February 2, 2026 09:24

loci-dev temporarily deployed to PROD__AL_DEMO February 2, 2026 09:48 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 01000b6 to 4c1b7f6 Compare February 2, 2026 11:20

loci-dev force-pushed the main branch 19 times, most recently from cd152fa to ab12294 Compare February 3, 2026 11:18

loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16

loci-dev force-pushed the main branch 8 times, most recently from 0db6c47 to 8019888 Compare March 8, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18

loci-dev force-pushed the main branch 4 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19261-gg-spec-simple-freq-check

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants