-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Eval bug: Autoparser misplaces non-thinking content with NVIDIA-Nemotron-Nano-9B-v2 #20325
Description
Name and Version
Compiled from current master branch (c96f608), since the model won't load without #20270.
C:\llama.cpp-master\build\bin\Debug>llama-cli.exe --version
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz)
load_backend: failed to find ggml_backend_init in C:\llama.cpp-master\build\bin\Debug\ggml-cpu.dll
version: 0 (unknown)
built with MSVC 19.44.35223.0 for x64
Operating systems
Windows, Linux
GGML backends
CPU, CUDA, Vulkan
Hardware
I don't think this issue is related to hardware; I've tested on three different machines with different backends, and they all have the same issue.
Models
NVIDIA-Nemotron-Nano-9B-v2 IQ2_M from bartowski: (https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/blob/main/nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf). Tried both the model built-in chat template and the one in llama.cpp repo; both have the exact same issue.
I originally discovered this issue when trying to implement /no_think with a custom chat template for Qwen3.5-35B-A3B. The modified chat template is here; it works correctly before the autoparser PR.
Problem description & steps to reproduce
NVIDIA-Nemotron-Nano-9B-v2 supports both thinking and non-thinking mode in a single model, and supports switching between them in-conversation with a chat template trick, as documented in model card. It works correctly before, however since the new autoparser PR, after switching into non-thinking mode with /no_think, while the model correctly skips thinking, the model's output is no longer being treated as normal content as it should, but as thinking content.
To reproduce this issue, just run llama-cli or llama-server with this model, and include /no_think in your message; the former shows a [Start thinking] line, and the latter puts the entirety of model output into a reasoning block.
I'm not at all sure how this autoparser works (tried playing with llama-debug-template-parser and llama-template-analysis but got no meaningful insights), but my guess is that the new parser simply assumes the model begins generation with thinking content if it believes the chat template supports thinking, which is true for most models but irrecoverably breaks any attempt to implement in-coversation thinking mode switching.
First Bad Commit
Build b8227 (autoparser PR) has the issue, while the previous build, b8226, does not. Thus I'm fairly confident that the issue is related to autoparser.
Relevant log output
Logs
Commit c96f608 (the model "starts thinking" even though it's not)
C:\llama.cpp-master\build\bin\Debug>llama-cli.exe -m "C:\nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf" --no-repack --ctx-size 4096 -fit off --chat-template-file "C:\NVIDIA-Nemotron-Nano-v2.jinja"
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz)
load_backend: failed to find ggml_backend_init in C:\llama.cpp-master\build\bin\Debug\ggml-cpu.dll
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b0-unknown
model : nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> /no_think hello
[Start thinking]
Hello! How can I assist you today?
[ Prompt: 0.3 t/s | Generation: 0.3 t/s ]
>Build b8226 (output content is normal):
C:\llama-b8226-bin-win-cpu-x64>llama-cli.exe -m "C:\nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf" --no-repack --ctx-size 4096 -fit off --chat-template-file "C:\NVIDIA-Nemotron-Nano-v2.jinja"
load_backend: loaded RPC backend from C:\llama-b8226-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b8226-bin-win-cpu-x64\ggml-cpu-haswell.dll
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8226-34df42f7b
model : nvidia_NVIDIA-Nemotron-Nano-9B-v2-IQ2_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> /no_think hello
Hello! How can I assist you today? 😊
[ Prompt: 5.0 t/s | Generation: 4.1 t/s ]
>