Skip to content

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19261-gg-spec-simple-freq-check
Open

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19261-gg-spec-simple-freq-check

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: ggml-org/llama.cpp#19261

fix #19231

For the spec-simple method, we don't need to keep track of the last length to rate-limit the generations. We can simply use an incremental counter. This makes the speculator work with "Regenerate" of last message or branching the conversation from previous messages.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 2, 2026

Overview

Analysis of commit a104cff ("spec : fix the check-rate logic of ngram-simple") across llama.cpp reveals minimal performance impact. Of 115,425 total functions, only 2 were modified (0.0017%), with no new or removed functions. The changes fix a correctness issue in n-gram speculative decoding, introducing a 29ns throughput time increase (+5.03%) while maintaining response time impact at 0.11%.

Power Consumption by Binary:

  • build.bin.llama-tts: 360,886.57 nJ (+0.001%)
  • build.bin.llama-cvector-generator: 355,773.62 nJ (+0.001%)
  • build.bin.libllama.so: 249,105.58 nJ (-0.0%)
  • build.bin.libmtmd.so: 179,022.45 nJ (0.0%)
  • build.bin.llama-tokenize: 38,524.70 nJ (0.0%)
  • build.bin.llama-quantize: 43,714.74 nJ (0.0%)
  • build.bin.llama-qwen2vl-cli: 277.24 nJ (0.0%)
  • build.bin.llama-gemma3-cli: 277.24 nJ (0.0%)
  • build.bin.llama-gguf-split: 40,060.05 nJ (0.0%)
  • build.bin.llama-llava-cli: 277.24 nJ (0.0%)
  • build.bin.llama-minicpmv-cli: 277.24 nJ (0.0%)
  • build.bin.libggml.so: 5,124.39 nJ (0.0%)
  • build.bin.libggml-cpu.so: 157,685.86 nJ (0.0%)
  • build.bin.libggml-base.so: 73,208.69 nJ (0.0%)
  • build.bin.llama-bench: 60,119.52 nJ (0.0%)

Function Analysis

common_ngram_simple_draft (build.bin.llama-tts, build.bin.llama-cvector-generator):

  • Throughput: 575ns → 604ns (+29ns, +5.03%)
  • Response: 26,398ns → 26,427ns (+29ns, +0.11%)

The function refactored check-rate logic from position-based (idx_last_check + check_rate > cur_len) to counter-based (check_id++ >= check_rate), fixing unpredictable pattern matching behavior in speculative decoding. The 29ns increase results from additional counter increment/comparison operations. Changes improve code correctness and maintainability while maintaining negligible impact on overall inference performance (<0.03% of typical 10-100ms token generation time). The identical throughput and response time changes confirm the impact is isolated to the function body with no propagation to called functions.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from d9cffb7 to 1e94f5e Compare February 2, 2026 09:24
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 01000b6 to 4c1b7f6 Compare February 2, 2026 11:20
@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 2, 2026

Overview

Analysis of 115,469 functions across 15 binaries reveals minimal performance impact from speculative decoding refactoring. Modified: 80 functions (0.07%), new: 44, removed: 55, unchanged: 115,290.

Power consumption changes:

  • build.bin.llama-tts: -0.082% (-295 nJ)
  • build.bin.llama-cvector-generator: -0.102% (-364 nJ)
  • build.bin.libmtmd.so, build.bin.libllama.so: <0.001%
  • build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0%

Function Analysis

common_speculative_state_ngram_simple::draft() (both binaries): Intentional optimization adding check-rate throttling. Throughput time increased +79% (+61ns) due to added counter logic, but response time improved -1% (-284ns to -294ns) by reducing frequency of expensive O(n²) pattern matching. Net positive optimization.

std::match_results::_M_establish_failed_match (llama-tts): Throughput time improved -53% (-81ns), response time -4% (-81ns). Benefit from std::regex::optimize flag adoption in codebase, improving regex automaton efficiency.

Standard library functions (various): Mixed compiler optimization effects. Improvements include std::_Rb_tree::end() (-75% throughput), json_sax_dom_callback_parser::null() (-75% throughput), std::_Function_handler::_M_invoke (-46% throughput). Regressions include std::vector::end() (+307% throughput, +183ns), nlohmann::json::get() (+307% throughput, +183ns), std::make_shared (+207% throughput, +130ns). All changes are in non-critical paths (initialization, text preprocessing, configuration loading) with absolute impacts under 200ns.

Additional Findings

Zero impact on performance-critical inference paths: matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends remain unchanged. Changes isolated to CPU-side speculative decoding utilities in common library. No GPU/ML operations affected.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 19 times, most recently from cd152fa to ab12294 Compare February 3, 2026 11:18
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 0db6c47 to 8019888 Compare March 8, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants