Skip to content

convert : set "add bos" == True for Gemma 4#21500

Merged
ggerganov merged 2 commits intomasterfrom
gg/convert-gemma-4-add-bos-true
Apr 6, 2026
Merged

convert : set "add bos" == True for Gemma 4#21500
ggerganov merged 2 commits intomasterfrom
gg/convert-gemma-4-add-bos-true

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Apr 6, 2026

Overview

cont #21309

Without a BOS token, the base Gemma 4 models have significantly degraded quality. There was a discussion about the correct value of this flag in #21309 (comment). But from my experiments, it should be set to True. Otherwise, the non-templated endpoints (/completions) and tools like llama-completion and llama-perplexity do not work correctly.

Additional information

Requirements

@ggerganov ggerganov requested a review from CISC as a code owner April 6, 2026 07:25
@ggerganov ggerganov requested review from ngxson and removed request for CISC April 6, 2026 07:25
@github-actions github-actions bot added the python python script changes label Apr 6, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 6, 2026

Hmm ok that wasn't expected because the HF tokenizer doesn't add the bos automatically.

Just wondering, should we also enforce this on C++ code? So that users don't need to regenerate gguf? For ref, we are using a dedicated tokenizer model for gemma 4 so it should be ok to introduce such fix I guess

@ggerganov
Copy link
Copy Markdown
Member Author

Just wondering, should we also enforce this on C++ code? So that users don't need to regenerate gguf?

Added in 4e19abc

@ggerganov
Copy link
Copy Markdown
Member Author

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented Apr 6, 2026

@taronaeo There is some issue with the runner: https://github.com/ggml-org/llama.cpp/actions/runs/24026325395/job/70065488813?pr=21500

Looks like a loose PCIe slot. I've re-seated the GPU and it looks okay now. Let me know if it gives trouble again :)

$ nvidia-smi

Mon Apr  6 17:43:57 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060        Off |   00000000:09:00.0  On |                  N/A |
|  0%   35C    P8             10W /  160W |      70MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2469      G   /usr/bin/gnome-shell                     57MiB |
|    0   N/A  N/A            2830      G   /usr/bin/Xwayland                         2MiB |
+-----------------------------------------------------------------------------------------+

@ggerganov ggerganov requested a review from CISC April 6, 2026 10:26
@ggerganov ggerganov merged commit 400ac8e into master Apr 6, 2026
55 of 57 checks passed
@henk717
Copy link
Copy Markdown

henk717 commented Apr 6, 2026

@ggerganov Its a bit more complex than this in the huggingface side so keep that in mind.
For E4B in Huggingface the bos is set to true, but in the conversion was previously disabled.
For the larger models the bos was added by the jinja and the BOS addition was set to off.

There is no unified ecosystem with this model, there are differences between the sizes. Make sure to test this when evaluating this PR.

That said, it won't hurt to have it since this helps text completions a lot, we do that on KoboldCpp as well. But we make sure it can't be added twice. If you have similar checks in place you should be ok.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 6, 2026

llama.cpp does remove BOS if it's added twice

@ggerganov ggerganov deleted the gg/convert-gemma-4-add-bos-true branch April 6, 2026 13:47
@Ki-Kolan
Copy link
Copy Markdown

Ki-Kolan commented Apr 6, 2026

This totally broke perplexity values for me.
This is the command I used:

./llama-perplexity -m /mnt/PCIE-NVME/llama.cpp.models/models/LLM/google_gemma-4-26B-A4B-it-IQ4_NL.gguf  -f /home/luis/Dokumente/Llama.cpp/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw -c 512 -ngl 99

With the newest build I get:

Final estimate: PPL = 25603.0768 +/- 511.42019

and with the commit (f51fd36) I get this:

Final estimate: PPL = 202.6017 +/- 2.96091

I don't know if the insanely high perplexity is expected, but it seems wrong.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 7, 2026

Run it in the server with --verbose and see if you get double bos added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants