convert : set "add bos" == True for Gemma 4#21500
Conversation
|
Hmm ok that wasn't expected because the HF tokenizer doesn't add the bos automatically. Just wondering, should we also enforce this on C++ code? So that users don't need to regenerate gguf? For ref, we are using a dedicated tokenizer model for gemma 4 so it should be ok to introduce such fix I guess |
Added in 4e19abc |
|
@taronaeo There is some issue with the runner: https://github.com/ggml-org/llama.cpp/actions/runs/24026325395/job/70065488813?pr=21500 |
Looks like a loose PCIe slot. I've re-seated the GPU and it looks okay now. Let me know if it gives trouble again :) $ nvidia-smi
Mon Apr 6 17:43:57 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:09:00.0 On | N/A |
| 0% 35C P8 10W / 160W | 70MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2469 G /usr/bin/gnome-shell 57MiB |
| 0 N/A N/A 2830 G /usr/bin/Xwayland 2MiB |
+-----------------------------------------------------------------------------------------+ |
|
@ggerganov Its a bit more complex than this in the huggingface side so keep that in mind. There is no unified ecosystem with this model, there are differences between the sizes. Make sure to test this when evaluating this PR. That said, it won't hurt to have it since this helps text completions a lot, we do that on KoboldCpp as well. But we make sure it can't be added twice. If you have similar checks in place you should be ok. |
|
llama.cpp does remove BOS if it's added twice |
|
This totally broke perplexity values for me. ./llama-perplexity -m /mnt/PCIE-NVME/llama.cpp.models/models/LLM/google_gemma-4-26B-A4B-it-IQ4_NL.gguf -f /home/luis/Dokumente/Llama.cpp/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw -c 512 -ngl 99With the newest build I get: Final estimate: PPL = 25603.0768 +/- 511.42019and with the commit (f51fd36) I get this: Final estimate: PPL = 202.6017 +/- 2.96091I don't know if the insanely high perplexity is expected, but it seems wrong. |
|
Run it in the server with --verbose and see if you get double bos added. |
Overview
cont #21309
Without a BOS token, the base Gemma 4 models have significantly degraded quality. There was a discussion about the correct value of this flag in #21309 (comment). But from my experiments, it should be set to True. Otherwise, the non-templated endpoints (
/completions) and tools likellama-completionandllama-perplexitydo not work correctly.Additional information
Requirements