convert : set "add bos" == True for Gemma 4 by ggerganov · Pull Request #21500 · ggml-org/llama.cpp

ggerganov · 2026-04-06T07:25:45Z

Overview

Without a BOS token, the base Gemma 4 models have significantly degraded quality. There was a discussion about the correct value of this flag in #21309 (comment). But from my experiments, it should be set to True. Otherwise, the non-templated endpoints (/completions) and tools like llama-completion and llama-perplexity do not work correctly.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ngxson · 2026-04-06T08:06:38Z

Hmm ok that wasn't expected because the HF tokenizer doesn't add the bos automatically.

Just wondering, should we also enforce this on C++ code? So that users don't need to regenerate gguf? For ref, we are using a dedicated tokenizer model for gemma 4 so it should be ok to introduce such fix I guess

ggerganov · 2026-04-06T09:20:03Z

Just wondering, should we also enforce this on C++ code? So that users don't need to regenerate gguf?

Added in 4e19abc

ggerganov · 2026-04-06T09:20:35Z

@taronaeo There is some issue with the runner: https://github.com/ggml-org/llama.cpp/actions/runs/24026325395/job/70065488813?pr=21500

taronaeo · 2026-04-06T09:45:46Z

@taronaeo There is some issue with the runner: https://github.com/ggml-org/llama.cpp/actions/runs/24026325395/job/70065488813?pr=21500

Looks like a loose PCIe slot. I've re-seated the GPU and it looks okay now. Let me know if it gives trouble again :)

$ nvidia-smi

Mon Apr  6 17:43:57 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060        Off |   00000000:09:00.0  On |                  N/A |
|  0%   35C    P8             10W /  160W |      70MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2469      G   /usr/bin/gnome-shell                     57MiB |
|    0   N/A  N/A            2830      G   /usr/bin/Xwayland                         2MiB |
+-----------------------------------------------------------------------------------------+

henk717 · 2026-04-06T12:51:22Z

@ggerganov Its a bit more complex than this in the huggingface side so keep that in mind.
For E4B in Huggingface the bos is set to true, but in the conversion was previously disabled.
For the larger models the bos was added by the jinja and the BOS addition was set to off.

There is no unified ecosystem with this model, there are differences between the sizes. Make sure to test this when evaluating this PR.

That said, it won't hurt to have it since this helps text completions a lot, we do that on KoboldCpp as well. But we make sure it can't be added twice. If you have similar checks in place you should be ok.

ngxson · 2026-04-06T13:21:00Z

llama.cpp does remove BOS if it's added twice

Ki-Kolan · 2026-04-06T16:40:07Z

This totally broke perplexity values for me.
This is the command I used:

./llama-perplexity -m /mnt/PCIE-NVME/llama.cpp.models/models/LLM/google_gemma-4-26B-A4B-it-IQ4_NL.gguf  -f /home/luis/Dokumente/Llama.cpp/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw -c 512 -ngl 99

With the newest build I get:

Final estimate: PPL = 25603.0768 +/- 511.42019

and with the commit (f51fd36) I get this:

Final estimate: PPL = 202.6017 +/- 2.96091

I don't know if the insanely high perplexity is expected, but it seems wrong.

Ph0rk0z · 2026-04-07T12:56:52Z

Run it in the server with --verbose and see if you get double bos added.

convert : set "add bos" == True for Gemma 4

3f1950e

ggerganov requested a review from CISC as a code owner April 6, 2026 07:25

ggerganov requested review from ngxson and removed request for CISC April 6, 2026 07:25

github-actions bot added the python python script changes label Apr 6, 2026

danbev approved these changes Apr 6, 2026

View reviewed changes

ngxson approved these changes Apr 6, 2026

View reviewed changes

cont : handle old GGUFs

4e19abc

ggerganov requested a review from CISC April 6, 2026 10:26

CISC approved these changes Apr 6, 2026

View reviewed changes

ggerganov merged commit 400ac8e into master Apr 6, 2026
55 of 57 checks passed

gapeleon mentioned this pull request Apr 6, 2026

Fix Gemma4 quantized KV cache CPU FA performance ikawrakow/ik_llama.cpp#1590

Merged

ggerganov deleted the gg/convert-gemma-4-add-bos-true branch April 6, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : set "add bos" == True for Gemma 4#21500

convert : set "add bos" == True for Gemma 4#21500
ggerganov merged 2 commits intomasterfrom
gg/convert-gemma-4-add-bos-true

ggerganov commented Apr 6, 2026 •

edited

Loading

Uh oh!

ngxson commented Apr 6, 2026

Uh oh!

ggerganov commented Apr 6, 2026

Uh oh!

ggerganov commented Apr 6, 2026

Uh oh!

taronaeo commented Apr 6, 2026

Uh oh!

Uh oh!

henk717 commented Apr 6, 2026 •

edited

Loading

Uh oh!

ngxson commented Apr 6, 2026

Uh oh!

Ki-Kolan commented Apr 6, 2026

Uh oh!

Ph0rk0z commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

ggerganov commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ngxson commented Apr 6, 2026

Uh oh!

ggerganov commented Apr 6, 2026

Uh oh!

ggerganov commented Apr 6, 2026

Uh oh!

taronaeo commented Apr 6, 2026

Uh oh!

Uh oh!

henk717 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 6, 2026

Uh oh!

Ki-Kolan commented Apr 6, 2026

Uh oh!

Ph0rk0z commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ggerganov commented Apr 6, 2026 •

edited

Loading

henk717 commented Apr 6, 2026 •

edited

Loading