Skip to content

UPSTREAM PR #17263: server : fix "can batch with" bug#207

Open
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17263-branch_ggml-org-gg/server-fix-can-batch-with
Open

UPSTREAM PR #17263: server : fix "can batch with" bug#207
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17263-branch_ggml-org-gg/server-fix-can-batch-with

Conversation

@DajanaV
Copy link
Copy Markdown
Collaborator

@DajanaV DajanaV commented Nov 14, 2025

Mirrored from ggml-org/llama.cpp#17263

While looking into #17260, found this error in the logic:

The slot_batched could end up being released (for example if the prompt does not fit into the context). The fix is to set slot_batched ptr only after we have actually queued any tokens for that slot.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 5c86b47 to ef7ca13 Compare November 14, 2025 13:15
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 14, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version f5949961 compared to base fcef186b reveals minimal performance changes across the LLaMA.cpp codebase. The primary code modification involves server batching logic improvements in tools/server/server.cpp, with no direct changes to core inference functions.

Key Findings

Performance Metrics:

  • Highest Response Time change: _RegexMask constructor (+0.082%, 22.51 ns → 22.52 ns)
  • Highest Throughput change: make_unique for graph input position bucket (+0.116%, 104.33 ns → 104.45 ns)
  • Both functions are non-core utilities with negligible impact on inference performance

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The core tokenization and inference pipeline remains unaffected, indicating no impact on tokens per second performance.

Power Consumption Analysis:
Negligible power consumption changes across all binaries (< 0.001% change):

  • build.bin.libllama.so: No measurable change
  • build.bin.llama-cvector-generator: -0.0001% change
  • build.bin.llama-run: -0.0001% change
  • All other binaries show stable power consumption

Assembly and Control Flow Analysis:
CFG comparison of the _RegexMask constructor reveals identical assembly code between versions. The 0.082% performance difference stems from binary layout changes rather than functional modifications, confirming the change is within measurement noise.

GitHub Code Review Insights:
The server batching logic fix addresses a pointer management bug where slot_batched could reference released slots. Changes improve memory safety through deferred pointer assignment and early exit conditions for non-processing slots. The fix maintains API compatibility while enhancing batching reliability.

Conclusion:
The performance variations are minimal and do not affect core inference capabilities. The server-side improvements provide better memory safety without impacting the primary LLM inference pipeline. No actionable performance optimizations are required as the changes represent normal compiler optimization variations rather than functional regressions.

@DajanaV DajanaV force-pushed the main branch 25 times, most recently from d9d7e55 to f333350 Compare November 18, 2025 08:11
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants