Skip to content

Eval bug: Autoparser misplaces non-thinking content with NVIDIA-Nemotron-Nano-9B-v2 #20325

@EZForever

Description

@EZForever

Name and Version

Compiled from current master branch (c96f608), since the model won't load without #20270.

C:\llama.cpp-master\build\bin\Debug>llama-cli.exe --version
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz)
load_backend: failed to find ggml_backend_init in C:\llama.cpp-master\build\bin\Debug\ggml-cpu.dll
version: 0 (unknown)
built with MSVC 19.44.35223.0 for x64

Operating systems

Windows, Linux

GGML backends

CPU, CUDA, Vulkan

Hardware

I don't think this issue is related to hardware; I've tested on three different machines with different backends, and they all have the same issue.

Models

NVIDIA-Nemotron-Nano-9B-v2 IQ2_M from bartowski: (https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/blob/main/nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf). Tried both the model built-in chat template and the one in llama.cpp repo; both have the exact same issue.

I originally discovered this issue when trying to implement /no_think with a custom chat template for Qwen3.5-35B-A3B. The modified chat template is here; it works correctly before the autoparser PR.

Problem description & steps to reproduce

NVIDIA-Nemotron-Nano-9B-v2 supports both thinking and non-thinking mode in a single model, and supports switching between them in-conversation with a chat template trick, as documented in model card. It works correctly before, however since the new autoparser PR, after switching into non-thinking mode with /no_think, while the model correctly skips thinking, the model's output is no longer being treated as normal content as it should, but as thinking content.

To reproduce this issue, just run llama-cli or llama-server with this model, and include /no_think in your message; the former shows a [Start thinking] line, and the latter puts the entirety of model output into a reasoning block.

I'm not at all sure how this autoparser works (tried playing with llama-debug-template-parser and llama-template-analysis but got no meaningful insights), but my guess is that the new parser simply assumes the model begins generation with thinking content if it believes the chat template supports thinking, which is true for most models but irrecoverably breaks any attempt to implement in-coversation thinking mode switching.

First Bad Commit

Build b8227 (autoparser PR) has the issue, while the previous build, b8226, does not. Thus I'm fairly confident that the issue is related to autoparser.

Relevant log output

Logs

Commit c96f608 (the model "starts thinking" even though it's not)

C:\llama.cpp-master\build\bin\Debug>llama-cli.exe -m "C:\nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf" --no-repack --ctx-size 4096 -fit off --chat-template-file "C:\NVIDIA-Nemotron-Nano-v2.jinja"
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz)
load_backend: failed to find ggml_backend_init in C:\llama.cpp-master\build\bin\Debug\ggml-cpu.dll

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> /no_think hello

[Start thinking]
Hello! How can I assist you today?


[ Prompt: 0.3 t/s | Generation: 0.3 t/s ]

>

Build b8226 (output content is normal):

C:\llama-b8226-bin-win-cpu-x64>llama-cli.exe -m "C:\nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf" --no-repack --ctx-size 4096 -fit off --chat-template-file "C:\NVIDIA-Nemotron-Nano-v2.jinja"
load_backend: loaded RPC backend from C:\llama-b8226-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b8226-bin-win-cpu-x64\ggml-cpu-haswell.dll

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8226-34df42f7b
model      : nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> /no_think hello

Hello! How can I assist you today? 😊


[ Prompt: 5.0 t/s | Generation: 4.1 t/s ]

>

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingchat parserIssues related to the chat parser and chat templatesregressionA regression introduced in a new build (something that was previously working correctly)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions