UPSTREAM PR #17263: server : fix "can batch with" bug by DajanaV · Pull Request #207 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-14T10:38:42Z

While looking into #17260, found this error in the logic:

The slot_batched could end up being released (for example if the prompt does not fit into the context). The fix is to set slot_batched ptr only after we have actually queued any tokens for that slot.

loci-review · 2025-11-14T13:35:33Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version f5949961 compared to base fcef186b reveals minimal performance changes across the LLaMA.cpp codebase. The primary code modification involves server batching logic improvements in tools/server/server.cpp, with no direct changes to core inference functions.

Key Findings

Performance Metrics:

Highest Response Time change: _RegexMask constructor (+0.082%, 22.51 ns → 22.52 ns)
Highest Throughput change: make_unique for graph input position bucket (+0.116%, 104.33 ns → 104.45 ns)
Both functions are non-core utilities with negligible impact on inference performance

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The core tokenization and inference pipeline remains unaffected, indicating no impact on tokens per second performance.

Power Consumption Analysis:
Negligible power consumption changes across all binaries (< 0.001% change):

build.bin.libllama.so: No measurable change
build.bin.llama-cvector-generator: -0.0001% change
build.bin.llama-run: -0.0001% change
All other binaries show stable power consumption

Assembly and Control Flow Analysis:
CFG comparison of the _RegexMask constructor reveals identical assembly code between versions. The 0.082% performance difference stems from binary layout changes rather than functional modifications, confirming the change is within measurement noise.

GitHub Code Review Insights:
The server batching logic fix addresses a pointer management bug where slot_batched could reference released slots. Changes improve memory safety through deferred pointer assignment and early exit conditions for non-processing slots. The fix maintains API compatibility while enhancing batching reliability.

Conclusion:
The performance variations are minimal and do not affect core inference capabilities. The server-side improvements provide better memory safety without impacting the primary LLM inference pipeline. No actionable performance optimizations are required as the changes represent normal compiler optimization variations rather than functional regressions.

server : fix "can batch with" bug

98c9667

DajanaV temporarily deployed to PROD__AL_DEMO November 14, 2025 10:38 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 2 times, most recently from 5c86b47 to ef7ca13 Compare November 14, 2025 13:15

DajanaV force-pushed the main branch 25 times, most recently from d9d7e55 to f333350 Compare November 18, 2025 08:11

loci-dev force-pushed the main branch 30 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17263: server : fix "can batch with" bug#207

UPSTREAM PR #17263: server : fix "can batch with" bug#207
DajanaV wants to merge 1 commit intomainfrom
upstream-PR17263-branch_ggml-org-gg/server-fix-can-batch-with

DajanaV commented Nov 14, 2025

Uh oh!

loci-review bot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 14, 2025

Uh oh!

loci-review bot commented Nov 14, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants