UPSTREAM PR #17251: Kimi-K2-Thinking native tool calling format by DajanaV · Pull Request #202 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-13T20:37:01Z

The implementation might support Kimi-K2-Instruct too, but I don't have enough disk space to test now :(

Almost silly copy-paste from DeepSeek V3.1 ggml-org/llama.cpp#15533, modified according to https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/tool_call_guidance.md: matching function id instead of plain function name.

Considerations:

The official template does not contain any <think> tag at the end, so thinking_forced_open is false. Should we test it by modify the template manually?
Did not add template update instruction to models/templates/README.md for now, because their template has tojson(separators=(',', ':')). Although the value of separators is the same as default value, but we must remove it to make the template work for minja.
DeepSeek V3.1 might be possible to generate <｜tool▁calls▁begin｜>tool... and ignoring <｜tool▁call▁begin｜>, but I have not observed such behavior in Kimi-K2-Thinking and always get <|tool_calls_section_begin|><|tool_call_begin|>, therefore I'm removing the ? in the function regex.
https://github.com/ggml-org/llama.cpp/blob/c4abcb2457217198efdd67d02675f5fddb7071c2/common/chat.cpp#L1751
Actually, I always get an extra <|tool_calls_section_end|> when keeping ?, but I have not been able to fix it, so finally removed the ?.
Have not tested lower quantized variants, maybe they could have different behavior which need to adapt the current parser?

For maintainers: I may have a busy weekend so fell free to edit directly if I'm not able to reply in time.

Closes #17155.

loci-review · 2025-11-13T21:22:27Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 4f972376 compared to baseline c8ed6535 reveals a significant performance regression in STL vector operations within the llama-cvector-generator binary, coinciding with the implementation of Kimi K2 tool calling format support.

Key Findings

Highest Performance Impact:

std::vector<common_chat_msg>::end() shows 217% Response Time increase (82 ns → 261 ns) and 299% Throughput degradation (60 ns → 239 ns)
This function is not part of core inference pipeline (llama_decode, llama_encode, llama_tokenize), so tokens per second performance remains unaffected

Power Consumption Analysis:

build.bin.llama-cvector-generator: +0.67% increase (329,915 nJ → 332,137 nJ)
build.bin.llama-run: +0.65% increase (282,849 nJ → 284,693 nJ)
build.bin.llama-tts: +0.39% increase (339,098 nJ → 340,418 nJ)
All other binaries show no measurable change

Root Cause Analysis:

Flame Graph: Reveals __stack_chk_fail activation (8 ns overhead), indicating stack protection triggering during vector operations
CFG Comparison: Shows compiler optimization regression where function prologue was split into two basic blocks, adding unnecessary branching overhead
Code Review: The performance regression is not caused by the Kimi K2 implementation but appears to be a compiler optimization artifact

Technical Details:
The degraded function exhibits stack buffer overflow protection activation and suboptimal code generation with additional unconditional branching, suggesting build configuration changes rather than functional code issues.

Actionable Recommendations

Investigate Build Configuration: Compare compiler optimization flags between versions to identify changes affecting STL code generation
Validate Compiler Settings: Ensure consistent optimization levels and verify no unintended stack protection modifications
Monitor Vector Operations: Profile other vector-intensive operations in chat message processing for similar regressions

The Kimi K2 functionality itself is well-implemented and isolated from core inference paths, with the performance issue stemming from build system changes rather than the new feature implementation.

KiruyaMomochi added 5 commits November 11, 2025 03:47

chat : Kimi-K2-Thinking tool calling support

0c3c896

fix : escape vertical bar in regex

94d85cc

fix: function call with id

7c8a694

fix: kimi-k2 tool calling grammar

56153aa

fix: kimi-k2 tool calling testing with correct tool calling format

accad29

DajanaV temporarily deployed to PROD__AL_DEMO November 13, 2025 20:37 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 8798da8 to 2b1a9e2 Compare November 13, 2025 21:08

DajanaV force-pushed the main branch 22 times, most recently from 4ab2d66 to 4fb52c0 Compare November 16, 2025 19:06

loci-dev force-pushed the main branch 30 times, most recently from 4600128 to 865bc12 Compare November 26, 2025 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17251: Kimi-K2-Thinking native tool calling format#202

UPSTREAM PR #17251: Kimi-K2-Thinking native tool calling format#202
DajanaV wants to merge 5 commits intomainfrom
upstream-PR17251-branch_KiruyaMomochi-kimi-k2-thinking

DajanaV commented Nov 13, 2025

Uh oh!

loci-review bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 13, 2025

Uh oh!

loci-review bot commented Nov 13, 2025

Performance Analysis Summary

Overview

Key Findings

Actionable Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants