common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

hksdpc255 · 2025-11-02T09:38:02Z

Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Based on PR #15904, this patch introduces a generalized implementation for almost all XML-style tool-call formats.

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Additional Notes

All unit tests have passed.
Community testing is welcome! Please try it out with your model integrations.
If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

MikeLP · 2025-11-02T11:31:33Z

I'm looking forward to get this PR merged!

@hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is?

hksdpc255 · 2025-11-02T13:17:52Z

For now, I’d recommend using a custom template if you’re running more complex workloads.
As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

Edit: The official template is now working properly. There’s no longer need for a custom template.

Edit2: Official template support for Minimax-M2 has been removed. See comment and ochafik/minja#7 (comment) for details.

ochafik · 2025-11-02T18:25:06Z

FYI I've updated (my fork of) Minja w/ support for GLM 4.6's template.
Might affect how you deal w/ the polyfills, as it should now detect GLM's tool call capability properly.

hksdpc255 · 2025-11-03T01:20:47Z

@ochafik Excellent work! Once llama.cpp syncs your changes, some parts of this PR can be safely removed.

However, there are still a few small patches needed — for example, replacing dict.items() with dict | items.

hksdpc255 · 2025-11-03T01:25:02Z

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

ochafik · 2025-11-03T01:36:24Z

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@hksdpc255 Both should be supported. The confusing error you probably got was because minja implements items() on dict but not on str. It should detect whether the template expects arguments to be an object instead of a more common json string of said object (see requires_object_arguments), and adjust the inputs accordingly: now hopefully works for GLM 4.6.

As for list[-1], it's supported, but MinMax M2's template has a bug, see this comment.

And please feel free to file bugs on https://github.com/ochafik/minja, it's should be cleaner to add syntax support there than to patch things up in llama.cpp.

hksdpc255 · 2025-11-03T01:41:18Z

@ochafik Thank you for pointing that out. I’m currently applying your suggested fix in llama.cpp and will test whether it works as expected. Thanks again for the help!

hksdpc255 · 2025-11-03T01:49:53Z

Good news! The Minimax M2 tool call is now working.

I’ll push the fix later.

hksdpc255 · 2025-11-03T02:39:40Z

Screen shot for Zed editor:

Model: unsloth's UD-Q3_K_XL

emuchogu · 2025-11-03T05:47:04Z

Hi @hksdpc255 ,
I cloned your repo https://github.com/hksdpc255/llama.cpp/tree/xml_toolcall and unfortunately it's still not producing the initial think tag at least in the cli. See below.

Model: unsloth--MiniMax-M2-GGUF Q8_0

./llama-cli \
  -m /models/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/*/Q8_0/MiniMax-M2-Q8_0-00001-of-00005.gguf \
  -ngl 99 \
  -sm layer \
  -ts 1,1,1,1,1,1,1,1 \
  -c 78000 \
  -t 16 \
  --jinja \
  -i

Output:

> what is the capital of france?
Okay, the user asked a straightforward question: "What is the capital of France?" This is basic geography knowledge, so the answer should be simple. I don't need to overcomplicate things. 

Hmm, maybe the user is just testing if I know basic facts, or perhaps they're new to this kind of question. Either way, the response should be clear and concise. No need for extra details unless they ask follow-ups. 

I recall that Paris is the capital of France. It's one of the most well-known capitals globally, so this should be an easy one. The user might be a student working on homework, or someone prepping for trivia. Or maybe they're just curious—either way, I should confirm it confidently. 

No signs of confusion or deeper needs here. The question is very direct. I'll just state the answer plainly. If they want more info later, like landmarks or history, they'll ask. For now, keep it simple: Paris is the capital. 

Wait, should I add that it's also a major cultural hub? Nah, overcomplicating it. Just the fact. Done.
</think>

The capital of France is **Paris**. 

Paris is not only the political center but also a major cultural, economic, and gastronomic hub, famous for landmarks like the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées.

hksdpc255 · 2025-11-03T06:18:09Z

@emuchogu Sorry, I haven’t tested it with llama-cli — only with llama-server.

If you want <think> and </think> to appear in the content, append --reasoning-format none when running llama-server.

I’m not sure whether llama-cli uses the same parsing logic.

ServeurpersoCom · 2025-11-03T06:43:13Z

I’ve reverted my previous PR (reasoning-format-minimax-m2) and merged PR #16932 into my testing-branch16 for isolated testing.
I’m running llama-swap with the new XML tool-call parser to check MiniMax-M2 compatibility without any synthetic injection, using --reasoning-format none to observe the parser’s raw behavior.

sendLoadingState: true

macros:
  llama-server: >
    ../llama.cpp.pascal/build/bin/llama-server
    --port 8081
    -ngl 999
    -ctk q8_0
    -ctv q8_0
    -fa on
    --mlock
    -np 1
    --jinja
  models: /var/www/ia/models
  proxy: http://127.0.0.1:8081

  MoE-MiniMax-M2-230B-A10B:
    cmd: |
      ${llama-server}
      -m ${models}/unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf
      --temp 1.0
      --top-p 0.95
      --top-k 40
      --n-cpu-moe 50
      --ctx-size 65536
      --reasoning-format none
    proxy: ${proxy}
    filters:
      strip_params: "temperature, top_p, top_k"

Without this PR :

Streaming, no initial <think> tag in the output:

Curl without streaming no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1192  100   973  100   219    259     58  0:00:03  0:00:03 --:--:--   317
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The user asks: \"What is the capital of France?\" The answer is Paris. This is a simple question. There's no disallowed content. So the answer is \"Paris.\" Possibly also mention that it's Paris. So answer: \"The capital of France is Paris.\" There's no reason to go beyond that. There's no conflict with policy. So final answer: \"Paris.\"\n</think>\n\nThe capital of France is **Paris**."
      }
    }
  ],
  "created": 1762152163,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6942-5698549e7",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 85,
    "prompt_tokens": 29,
    "total_tokens": 114
  },
  "id": "chatcmpl-gfe455eld4ThdT1D7Ji6CtuJm6md4V7W",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 273.966,
    "prompt_per_token_ms": 19.569,
    "prompt_per_second": 51.1012315396801,
    "predicted_n": 85,
    "predicted_ms": 3458.452,
    "predicted_per_token_ms": 40.6876705882353,
    "predicted_per_second": 24.577469920068282
  }
}
(root|~/llama.cpp.pascal)

With this PR :

Streaming :
reasoning go inside reasoning_content :

Curl without streaming, no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1265  100  1046  100   219    251     52  0:00:04  0:00:04 --:--:--   304
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm looking at how to respond to the question: \"What is the capital of France?\" The user expects a straightforward answer, which is \"Paris.\" I’ll keep it simple and concise, but I might consider adding a brief note about the Eiffel Tower. However, since the user didn't ask for extra information, I’ll focus on just saying \"Paris\" to fulfill their request. I want to ensure I’m following their guidelines accurately.\n</think>\n\nParis."
      }
    }
  ],
  "created": 1762152603,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6943-0619a5b7d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 92,
    "prompt_tokens": 29,
    "total_tokens": 121
  },
  "id": "chatcmpl-WqvR2S73aa7cZEyIN7lm42yuuatYZwqO",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 278.533,
    "prompt_per_token_ms": 19.895214285714285,
    "prompt_per_second": 50.263344020277664,
    "predicted_n": 92,
    "predicted_ms": 3852.551,
    "predicted_per_token_ms": 41.87555434782609,
    "predicted_per_second": 23.88028088401685
  }
}
(root|~/llama.cpp.pascal)

hksdpc255 · 2025-11-03T06:50:35Z

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

ServeurpersoCom · 2025-11-03T06:59:40Z

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

Yes, exactly: it works correctly in streaming mode (tested through the SvelteUI, which specifically designed to be debug-friendly without needing curl -N), but not in non-streaming mode.
So the initial tag still doesn’t appear when stream: false.

ServeurpersoCom · 2025-11-03T07:04:37Z

Toolcall debug on SvelteUI with your #16932 + #16618 :)

Custom JSON :

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "simple_addition_tool",
        "description": "A dummy calculator tool used for testing multi-argument tool call streaming.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {
              "type": "number",
              "description": "The first number to add."
            },
            "b": {
              "type": "number",
              "description": "The second number to add."
            }
          },
          "required": ["a", "b"]
        }
      }
    }
  ]
}

hksdpc255 · 2025-11-03T07:13:15Z

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

llama.cpp/common/chat.cpp

Lines 2748 to 2753 in af5216e

    
           if (!builder.syntax().parse_tool_calls) { 
        
               // MiniMax-M2 uses <think>...</think> tags for reasoning content 
        
               builder.try_parse_reasoning("<think>", "</think>"); 
        
               builder.add_content(builder.consume_rest()); 
        
               return; 
        
           }

Simply deleting the code above should fix the issue. I’ll run more tests before pushing a new commit.

ServeurpersoCom · 2025-11-03T07:25:47Z

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

I’ve successfully tested it without these lines of code and confirmed it works as expected for streaming / non streaming / reasoning_content / toolcall

ServeurpersoCom · 2025-11-03T07:39:16Z

I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output.

The .cpp files are already becoming huge and monolithic, making them harder to touch or refactor safely. The --reasoning-format options are also poorly named and not very explicit. In the long run, a modular templating system would help avoid piling up even more C++ parsing code.

If this work is meant to unify several next-generation parsers, maybe we could add a new keyword to --reasoning-format instead? It’s important to keep none as a truly no-parsing mode, since it’s essential for debugging new models.

Also, the current "auto" mode is actually just "deepseek" in practice, so it might be clearer to rename or document it that way to avoid confusion: and your unified detection logic could be implemented directly under auto (or deepseek, since they’re basically aliases) ?

sbrnaderi · 2025-11-07T16:40:03Z

For now, I’d recommend using a custom template if you’re running more complex workloads. As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

@hksdpc255 Could you please add your template file for GLM 4.5 that works with the PR to models/templates folder in the repo?

Thanks

ServeurpersoCom · 2025-11-07T16:53:03Z

I’ll run full tests on my side including delta.reasoning_content, tool_calls, and agentic loop behavior both on a stock Raspberry Pi 5 setup and on my main server once the PR is ready.

thomasjfox · 2025-11-07T18:34:08Z

Found an issue: the parser crashes when the LLM generates a start token inside reasoning content. I’ll work on a fix.

I think it would be best to wait until you have that fix implemented before I run my experiments again. That way we can determine whether it was the exact issue or something different. I can only crash / upgrade the server in the evenings. 😀

One thing I found out yesterday is that the crash was not related to my exact prompt. I tried a different longish prompt and it also crashed around 19000 - 20000 output tokens. The tests were conducted with the MiniMax M2 q2 and q4 quants from Unsloth.

Thank you for your work on this! It's highly appreciated.

hksdpc255 · 2025-11-08T03:57:13Z

For now, I’d recommend using a custom template if you’re running more complex workloads. As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

@hksdpc255 Could you please add your template file for GLM 4.5 that works with the PR to models/templates folder in the repo?

Thanks

@sbrnaderi Use official template. See comment

hksdpc255 · 2025-11-08T03:59:00Z

I’ll run full tests on my side including delta.reasoning_content, tool_calls, and agentic loop behavior both on a stock Raspberry Pi 5 setup and on my main server once the PR is ready.

@ServeurpersoCom It’s already ready for GLM-4.5, GLM-4.6, and Minimax-M2. Only a few minor issues remain for potential future cases.

hksdpc255 · 2025-11-08T04:12:16Z

Found an issue: the parser crashes when the LLM generates a start token inside reasoning content. I’ll work on a fix.

I think it would be best to wait until you have that fix implemented before I run my experiments again. That way we can determine whether it was the exact issue or something different. I can only crash / upgrade the server in the evenings. 😀

One thing I found out yesterday is that the crash was not related to my exact prompt. I tried a different longish prompt and it also crashed around 19000 - 20000 output tokens. The tests were conducted with the MiniMax M2 q2 and q4 quants from Unsloth.

Thank you for your work on this! It's highly appreciated.

@thomasjfox The issue you quoted isn’t related to your crash. Tt never triggers in my tests with GLM or Minimax. And GLM and Minimax never generate something like that even with very long context length.

Could you please provide more logs so I can investigate why the crash occurred? It’ll help me identify any potential underlying bugs more quickly.

…ranch16 — unified XML tool-call parser + streaming reasoning + GLM4.5/4.6 + MiniMax-M2

ServeurpersoCom · 2025-11-08T09:25:59Z

I’ll run full tests on my side including delta.reasoning_content, tool_calls, and agentic loop behavior both on a stock Raspberry Pi 5 setup and on my main server once the PR is ready.

@ServeurpersoCom It’s already ready for GLM-4.5, GLM-4.6, and Minimax-M2. Only a few minor issues remain for potential future cases.

MiniMax-M2-230B-A10B:
https://github.com/user-attachments/assets/54e1cc54-cf8d-4a31-ba6a-5da52939ee96

GLM-4.5-Air-106B:
https://github.com/user-attachments/assets/58d588d4-7543-4395-9044-91b6206cd8e9

thomasjfox · 2025-11-08T10:44:03Z

Found an issue: the parser crashes when the LLM generates a start token inside reasoning content. I’ll work on a fix.

I think it would be best to wait until you have that fix implemented before I run my experiments again. That way we can determine whether it was the exact issue or something different. I can only crash / upgrade the server in the evenings. 😀
One thing I found out yesterday is that the crash was not related to my exact prompt. I tried a different longish prompt and it also crashed around 19000 - 20000 output tokens. The tests were conducted with the MiniMax M2 q2 and q4 quants from Unsloth.
Thank you for your work on this! It's highly appreciated.

@thomasjfox The issue you quoted isn’t related to your crash. Tt never triggers in my tests with GLM or Minimax. And GLM and Minimax never generate something like that even with very long context length.

Could you please provide more logs so I can investigate why the crash occurred? It’ll help me identify any potential underlying bugs more quickly.

I'll try to help and nail down the issue. The python script is from $dayjob, so I can't share the full log. This is the last line before the crash using the original script and unchanged code from the PR:

Regex for tool start: <minimax:tool_call>\s*<invoke name="

I tried to reproduce with a random, open source python script, but that didn't work.

Two things caught my eye in the full log:

There are 7701 lines with Regex for tool start: <minimax:tool_call>\s*<invoke name="
In between are randomly 15 lines with Partial parse: <minimax:tool_call>\s*<invoke name="

When I grep the whole log with the generated output from MiniMax M2, I don't see any "minimax:tool_call" output in there at all.

I will add debug output to the code and play around some more.

aldehir · 2025-11-08T11:16:36Z

The regex crash is because of the reverse regex generated used to find a partial match. The regex contains [\S\s]* to match all the content before the partial match. If there is a substantial amount of content then it will hit the stack limit in the regex engine.

To avoid it, you can search for a literal such as the start tag, then proceed with regex. That will move the internal position in the builder and limit the search window for regex matching.

aldehir

I'm not a fan of the generalizations done here. I understand there is a desire to converge the parsing of XML-based tool calling, but I believe this approach is problematic. To sum it up, the interface does not feel ergonomic and certain parts are very hacky.

Regardless, I left some comments to help improve it (in my opinion).

Other notes:

Tool calls should not result in any content after them. The model will emit a stop token, but also the grammar should constrain it from producing anything else.
I don't believe code should be left "in case we need it." It can be added when needed. Many times, YAGNI.

aldehir · 2025-11-08T11:24:06Z

common/chat-parser-xml-toolcall.cpp

+// make a GBNF that accept any strings except those containing any of the forbidden strings.
+std::string make_gbnf_excluding(std::vector<std::string> forbids) {
+    constexpr auto charclass_escape = [](unsigned char c) -> std::string {
+        if (c == '\\' || c == ']' || c == '^' || c == '-') {
+            std::string s = "\\";
+            s.push_back((char)c);
+            return s;
+        }
+        if (isprint(c)) {
+            return std::string(1, (char)c);
+        }
+        char buf[16];
+        snprintf(buf, 15, "\\x%02X", c);
+        return std::string(buf);
+    };
+    constexpr auto build_expr = [charclass_escape](auto self, const std::vector<std::string>& forbids, int l, int r, int depth) -> std::string {
+        std::vector<std::pair<unsigned char, std::pair<int,int>>> children;
+        int i = l;
+        while (i < r) {
+            const std::string &s = forbids[i];
+            if ((int)s.size() == depth) {
+                ++i;
+                continue;
+            }
+            unsigned char c = (unsigned char)s[depth];
+            int j = i;
+            while (j < r && (int)forbids[j].size() > depth &&
+                   (unsigned char)forbids[j][depth] == c) {
+                ++j;
+            }
+            children.push_back({c, {i,j}});
+            i = j;
+        }
+        std::vector<std::string> alts;
+        if (!children.empty()) {
+            std::string cls;
+            for (auto &ch : children) cls += charclass_escape(ch.first);
+            alts.push_back(std::string("[^") + cls + "]");
+        }
+        for (auto &ch : children) {
+            std::string childExpr = self(self, forbids, ch.second.first, ch.second.second, depth+1);
+            if (!childExpr.empty()) {
+                std::string quoted_ch = "\"";
+                if (ch.first == '\\') quoted_ch += "\\\\";
+                else if (ch.first == '"') quoted_ch += "\\\"";
+                else if (isprint(ch.first)) quoted_ch.push_back(ch.first);
+                else {
+                    char buf[16];
+                    snprintf(buf, 15, "\\x%02X", ch.first);
+                    quoted_ch += buf;
+                }
+                quoted_ch += "\"";
+                std::string branch = quoted_ch + std::string(" ") + childExpr;
+                alts.push_back(branch);
+            }
+        }
+        if (alts.empty()) return "";
+        std::ostringstream oss;
+        oss << "( ";
+        for (size_t k = 0; k < alts.size(); ++k) {
+            if (k) oss << " | ";
+            oss << alts[k];
+        }
+        oss << " )";
+        return oss.str();
+    };
+    if (forbids.empty()) return "( . )*";
+    sort(forbids.begin(), forbids.end());
+    std::string expr = build_expr(build_expr, forbids, 0, forbids.size(), 0);
+    if (expr.empty()) {
+        std::string cls;
+        for (auto &s : forbids) if (!s.empty()) cls += charclass_escape((unsigned char)s[0]);
+        expr = std::string("( [^") + cls + "] )";
+    }
+    if (forbids.size() == 1)
+        return expr + "*";
+    else
+        return std::string("( ") + expr + " )*";
+}


This is only used once for a single literal. It seems crazy to try to handle every edge case. It's a common pattern in seeing throughout this PR.

Keep it simple, only produce an expression to exclude a single literal. E.g ( [^a] | "a" [^b] | ... )*

Could you provide an example that the current implementation fails to handle? I believe all cases should already be covered.

It's not about failing certain cases. The code does way too much when you only need it to produce a grammar for a single literal. This can be reduced to a ~10 line implementation.

This code is copied from my well-tested codebase, with only a few charclass_escape deletions. Creating a completely new implementation would be risky for me at this stage. If you still insist to have a simpler implemention, I can leave a TODO comment there for future reference. Once other issues are resolved, I can revisit and implement a new version.

common/chat-parser-xml-toolcall.cpp

aldehir · 2025-11-08T11:32:16Z

common/chat-parser-xml-toolcall.cpp

+
+                std::string param_rules;
+                if (parameters.contains("properties")) {
+                    std::vector<std::string> requiredParameters;


Why not an std::unordered_set?

Hmm… I’m not sure I fully understand your point. Could you explain a bit more why we should use a more complex std::unordered_set instead of a simple std::vector?

Because you sort, remove duplicates, and perform a binary search. Sounds like you need a set, why not just use one?

Ah, I see. Initially, I wasn’t very familiar with the JSON library and just copied the code from @pwilkin. But after a few tests, I find that the code will makes all parameters as optional. So I changed the implementation. I mistakenly assumed that json.get_to could only receive a vector, so I used a vector.

I also tend to avoid set/map in C++ due to their inefficiency for small datasets. In this case, using a binary search isn’t overly complicated, and I’ve already implemented it that way. Refactoring it to std::unordered_set would actually be a regression here.

Wait, didn't I just originally use a normal std::set? This seems like a simple problem - get required parameters, check if parameter is required.

Anyways, it feels to me like overkill doing optimizations for collections of generally <5 elements (I'm not sure I've ever seen an MCP call that has like more than 10 parameters).

@pwilkin Yes, you originally used a regular std::set, but it turned out to be empty for some reason. You can easily verify this by adding a simple test to your original code. the generated grammar will show that all parameters are treated as optional.

aldehir · 2025-11-08T11:37:37Z

common/chat-parser-xml-toolcall.cpp

+        // grammar trigger for tool call
+        data.grammar_lazy = true;
+        data.grammar_triggers.push_back({ COMMON_GRAMMAR_TRIGGER_TYPE_WORD, form.scope_start + form.tool_start });


What if tool_choice = required? As it stands, that seems like it would be a nightmare to implement in a generalized fashion.

This part of the code is copied from the original chat.cpp, as many other models use it. It’s a temporary but simple solution for now.

Maybe I will implement a full grammar for handling thinking blocks, markdown content, and tool calls later. : )

aldehir · 2025-11-08T11:42:33Z

common/chat-parser-xml-toolcall.cpp

+ * Parse content uses reasoning and XML-Style tool call
+ * TODO: Note that form.allow_toolcall_in_think is not tested yet. If anyone confirms it works, this comment can be removed.
+ */
+inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, const struct xml_tool_call_format & form, const std::string & start_think = "<think>", const std::string & end_think = "</think>") {


Seems like a very large function for inline.

Nonetheless, I don't think this is necessary. There is already a reasoning parsing function and it should be preferred.

This function is only used once, and cannot be used outside of this file. Inlining it shouldn’t be a problem, I think.

@aldehir We already had a discussion about this above and I'm generally of the same opinion you are.

See: comment

Once the buggy try_parse_reasoning is fixed, I'm happy to delete this.

aldehir · 2025-11-08T11:43:29Z

common/chat-parser-xml-toolcall.cpp

+        }
+    };
+    // Escape string literal to regex that match the literal
+    constexpr auto escape_regex = [](const std::string &s) {


What's wrong with the existing regex_escape() function? Also seems way over engineered...

I reviewed it earlier and found it unreliable for my use case. It would be a nightmare to debug my parser using that. So I implemented a simpler and more predictable version for the parser.

What about it is unreliable? None of the literals passed into this seem sufficiently complex to require this level of detail.

This escape_regex is used for common_regex, which only supports single-character escaping. The current regex_escape is just a simple replacement for certain characters and doesn’t handle special characters such as newline, tab, or multi-byte UTF-8 characters. This caused a crash and cost me about an hour to debug.

I saw your comment. Do you have any ideas for replacing the functionality of tool_call_start_regex? I’d be very happy to remove the regex-related code if there’s a suitable alternative.

aldehir · 2025-11-08T11:47:33Z

common/json-partial.cpp

+        // handle unclosed top-level primitive
+        if (err_loc.position != 0 && !healing_marker.empty() && err_loc.stack.empty()) {
+            std::string str(it, temptative_end);
+            const auto & magic_seed = out.healing_marker.marker = healing_marker;
+            if (can_parse(str + "\"")) {
+                // Was inside an string
+                str += (out.healing_marker.json_dump_marker = magic_seed) + "\"";
+            } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
+                // Was inside an string after an escape
+                str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"";
+            } else {
+                // TODO: handle more unclosed top-level primitive if the stack was empty but we got an error (e.g. "tru", "\"", etc...)
+                // fprintf(stderr, "Closing: TODO\n");
+                return false;
+            }
+            out.json = json::parse(str);
+            it = temptative_end;
+            return true;
+        }


This needs a test case.

…ranch16 — unified XML tool-call parser + streaming reasoning + GLM4.5/4.6 + MiniMax-M2

hksdpc255 · 2025-11-08T15:01:19Z

I'm not a fan of the generalizations done here. I understand there is a desire to converge the parsing of XML-based tool calling, but I believe this approach is problematic. To sum it up, the interface does not feel ergonomic and certain parts are very hacky.

Regardless, I left some comments to help improve it (in my opinion).

Other notes:
1. Tool calls should not result in any content after them. The model will emit a stop token, but also the grammar should constrain it from producing anything else.

2. I don't believe code should be left "in case we need it." It can be added when needed. Many times, YAGNI.

@aldehir Thanks for taking the time to review this in detail. Seems that you're not so happy with my PR. Just to clarify a few points:

I actually tried adding an end-of-generation (EOG) or observation token to the grammar earlier, but it caused llama-server to crash, so I had to revert that approach.
The current implementation has already been tested and supports a wider range of cases. Replacing it with something that only covers a subset of these features would likely make it less stable in practice.
Most of the core logic is copied from existing implementations for other models. I’m still getting familiar with the API provided by llama.cpp and have been learning primarily by studying those implementations.
I’ve tested my code thoroughly. The buggy sections from llama.cpp have been reimplemented locally to ensure they don’t affect other parts of the system.

I’m open to improving the structure further, but at the moment, preserving reliability across models (GLM / MiniMax) seems more important.

I’ll do some additional code cleanup later.

aldehir · 2025-11-08T17:24:32Z

I actually tried adding an end-of-generation (EOG) or observation token to the grammar earlier, but it caused llama-server to crash, so I had to revert that approach.

Sorry, I should have clarified. The grammar is ok as is. Since there are no rules after the last tag in the grammar, it will not produce additional content once reached.

You mentioned in #16932 (comment) about parsing thinking blocks between tool calls. This not necessary because (1) no model does that in a single generation, as far as I know, and (2) the constrained decoding makes it impossible.

I appreciate the work you put in this PR. My comments are focused on simplifying to make the PR easier to swallow.

Ultimately, whether I approve or not does not matter. It is at the discretion of the maintainers.

hksdpc255 · 2025-11-08T17:34:35Z

At first, I mistakenly thought that Minimax-M2’s advertised “Interleaved Thinking” meant that tool calls, reasoning, and content were interleaved within a single generation. That’s why I implemented this rather complex logic. Later, I realized it simply means that reasoning is present before each tool call.

Honestly, I really dislike this kind of marketing where existing concepts are just renamed — it’s very confusing. But there’s not much I can do at this point. Since I’ve already implemented the complex logic, I feel that rolling it back would be even more difficult, so it’s probably better to just leave it as is.

ServeurpersoCom · 2025-11-08T17:39:47Z

At first, I mistakenly thought that Minimax-M2’s advertised “Interleaved Thinking” meant that tool calls, reasoning, and content were interleaved within a single generation. That’s why I implemented this rather complex logic. Later, I realized it simply means that reasoning is present before each tool call.

Honestly, I really dislike this kind of marketing where existing concepts are just renamed — it’s very confusing. But there’s not much I can do at this point. Since I’ve already implemented the complex logic, I feel that rolling it back would be even more difficult, so it’s probably better to just leave it as is.

Many models actually do generate reasoning content before each tool call.
Isn’t it rather that the model requires all the <think>...</think> content and XML tags to be sent back in the context for each subsequent request? From what I understand, that’s what the documentation suggests.

But that doesn’t call the work in this PR into question :) since later on we could, if desired, add a client-side option to capture the reasoning content and return it in whatever format the model expects :)
(and besides, it would just be basic concatenation: a simple checkbox and two fields, no real parsing to support all similar models. Personally, I agree with you, I really dislike this kind of vague marketing terminology, and it eats up a huge amount of context on top of that)

aldehir · 2025-11-08T17:50:22Z

Honestly, I really dislike this kind of marketing where existing concepts are just renamed — it’s very confusing. But there’s not much I can do at this point. Since I’ve already implemented the complex logic, I feel that rolling it back would be even more difficult, so it’s probably better to just leave it as is.

Agreed about the renaming. In your defense, I don't think anyone had a clear definition of "interleaved thinking." But, I do feel the PR is the time to address these misunderstandings. Nonetheless, I'll concede if this is the final version you want to present.

Many models actually do generate reasoning content before each tool call.

I don't believe there is any dispute over this. The misunderstanding is that a model may emit a reasoning, a tool call, then another reasoning in the same generation. I don't believe any models do that.

ServeurpersoCom · 2025-11-08T17:55:36Z

I don't believe there is any dispute over this. The misunderstanding is that a model may emit a reasoning, a tool call, then another reasoning in the same generation. I don't believe any models do that.

To me, this has nothing to do with tool calls. The documentation explicitly says that you just need to resend the full generation history, including the reasoning:

https://huggingface.co/MiniMaxAI/MiniMax-M2
IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the <think>...</think> format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the <think>...</think> part, otherwise, the model's performance will be negatively affected.

This is to comply with the SFT phase training. But this PR doesn't have to take that into account.

aldehir · 2025-11-08T17:57:52Z

The documentation explicitly says that you just need to resend the full generation history, including the reasoning:

Because the template strips the reasoning if it is not from a tool call after the last user message (tool output responses not included).
https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/chat_template.jinja

        {%- if reasoning_content and loop.index0 > ns.last_user_index -%}
            {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
        {%- endif -%}

ServeurpersoCom · 2025-11-08T18:01:27Z

Because the template strips the reasoning if it is not from a tool call after the last user message (tool output responses not included).

Then it makes no sense at all. I’ll stop trying to understand this model’s behavior.
EDIT : more information here https://huggingface.co/MiniMaxAI/MiniMax-M2/discussions/38

hksdpc255 · 2025-11-08T18:02:44Z

I’ll work on further cleanups and improvements to make the implementation more robust and simpler. I may push the changes tonight.

ServeurpersoCom · 2025-11-08T18:07:57Z

The official explanation from the MiniMax team actually shows that the chat template is a later compromise: it removes reasoning blocks from previous turns to save context, which inevitably degrades the model’s performance compared to the original design described in the documentation.

Actually I get it now, and you were right, @aldehir: they made sure the inner reasoning loops run at full performance (so it’s up to the orchestrator to handle that case and resend the reasoning before each tool call), but they clean things up on the next user turn. A good compromise (and off-topic for this PR: I’ll handle it in my agentic loop to test it)

ggerganov · 2025-11-08T21:12:15Z

Ultimately, whether I approve or not does not matter. It is at the discretion of the maintainers.

@aldehir Note that we don't really have an active maintainer for this part of the codebase. I can merge this if it has your approval. Otherwise, it will likely remain a separate branch until someone steps in to support this.

pwilkin · 2025-11-08T21:31:28Z

@ggerganov I was planning to actually take over the chat parser part since it's something I have some experience with, but to do that will require a proper refactoring of the codebase, not adding to the already hacky approaches. That's why I wanted to just do a simple PR for basic MiniMax M2 tool-calling support for now, but @hksdpc255 put in a lot of work into this, so I wanted to give it a chance instead.

Added @aldehir because he seems to be the most oriented from all the people who touched the parser code recently ;)

CISC · 2025-11-08T21:45:18Z

Added @aldehir because he seems to be the most oriented from all the people who touched the parser code recently ;)

If one or both of you are willing to own this that would be great.

odellus · 2025-11-08T23:08:07Z

Seems models only did thinking tokens at the beginning of the response instead of before each tool call because that's how their chat templates were constructed. I've got glm-4.5-air thinking before every tool call which you can see here because I added --reasoning-format none.

Learn something new every day.

Haven't rigorously evaluated, but this seems like desirable behavior to me as an end user

Add streaming test for both GLM 4.5 and minimax-m2. Cleanup for preserved_tokens. Cleanup for grammar rule name. Enhance the parser's stability.

hksdpc255 · 2025-11-10T01:21:16Z

This version appears to be the most stable so far. It correctly handles outputs even when the model generates content character by character.

I’ll keep an eye on any new feedback from the maintainers and make further improvements if needed.

aaronnewsome · 2025-11-10T04:01:26Z

i'll give it a try too. i've been running the version i compiled 4-5 days ago, pretty much nonstop with opencode. only a handful of crashes and those were with claude code via claude code router. it's been very stable, very few tool call fails. i'll get the latest compiled and give it a try

hksdpc255 added 2 commits November 2, 2025 08:20

Add files via upload

e816ea8

fix unit test

5a2ac74

hksdpc255 requested a review from ggerganov as a code owner November 2, 2025 09:38

hksdpc255 mentioned this pull request Nov 2, 2025

common: Yet another add GLM-4.5/GLM-4.6 tool calling support #15904

Closed

github-actions bot added the testing Everything test related label Nov 2, 2025

fix crashes for --reasoning-format=none

22fc731

hksdpc255 mentioned this pull request Nov 3, 2025

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

Draft

ochafik mentioned this pull request Nov 3, 2025

Support GLM 4.6 template ochafik/minja#5

Merged

Patch buggy official MiniMax-M2 chat template

af5216e

hksdpc255 mentioned this pull request Nov 3, 2025

Model: Minimax M2 #16831

Merged

hksdpc255 mentioned this pull request Nov 3, 2025

Model: Minimax M2 - chat support #16946

Closed

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Nov 3, 2025

Merge PR ggml-org#16932 (xml_toolcall) into testing-branch16

861a094

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Nov 8, 2025

Merge PR ggml-org#16932 (xml_toolcall, commit e5529dd) into testing-b…

dc74eb6

…ranch16 — unified XML tool-call parser + streaming reasoning + GLM4.5/4.6 + MiniMax-M2

aldehir reviewed Nov 8, 2025

View reviewed changes

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Nov 8, 2025

Merge PR ggml-org#16932 (xml_toolcall, commit e5529dd) into testing-b…

150199c

…ranch16 — unified XML tool-call parser + streaming reasoning + GLM4.5/4.6 + MiniMax-M2

hksdpc255 added 2 commits November 9, 2025 11:33

add streaming tests & enhancement & cleanups

29ef8f0

Add streaming test for both GLM 4.5 and minimax-m2. Cleanup for preserved_tokens. Cleanup for grammar rule name. Enhance the parser's stability.

Merge branch 'ggml-org:master' into xml_toolcall

82345e9

common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

Are you sure you want to change the base?

common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

Conversation

hksdpc255 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Grammar-constrained tool-call outputs

Streaming support for tool-call parsing

Automatic chat-template fixing

In-context reasoning

Additional Notes

Uh oh!

MikeLP commented Nov 2, 2025

Uh oh!

hksdpc255 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochafik commented Nov 2, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ochafik commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emuchogu commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without this PR :

With this PR :

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbrnaderi commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 7, 2025

Uh oh!

thomasjfox commented Nov 7, 2025

Uh oh!

hksdpc255 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 8, 2025

Uh oh!

ServeurpersoCom commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjfox commented Nov 8, 2025

Uh oh!

aldehir commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 2, 2025 •

edited

Loading

hksdpc255 commented Nov 2, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

sbrnaderi commented Nov 7, 2025 •

edited

Loading

hksdpc255 commented Nov 8, 2025 •

edited

Loading

hksdpc255 commented Nov 8, 2025 •

edited

Loading

ServeurpersoCom commented Nov 8, 2025 •

edited

Loading

aldehir commented Nov 8, 2025 •

edited

Loading

aldehir left a comment •

edited

Loading

hksdpc255 Nov 8, 2025 •

edited

Loading