model: support gemma 4 (vision + moe, no audio) by ngxson · Pull Request #21309 · ggml-org/llama.cpp

ngxson · 2026-04-02T14:47:39Z

Overview

Fix a bug where model with both vision/audio cannot be converted properly

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ngxson · 2026-04-02T14:52:29Z

Running CI on my fork for faster result: ngxson#95

osanseviero · 2026-04-02T15:11:02Z

lgtm

CISC · 2026-04-02T14:58:58Z

convert_hf_to_gguf.py

+        self.gguf_writer.add_add_space_prefix(False)
+        self.gguf_writer.add_add_bos_token(False) # already added via the chat template


If already added in the chat template that means it should be True, or is it not a single token?

the chat template already had {{ bos_token }}, so add_bos_token is not necessary (though even if it set to True here, llama.cpp do have a logic to avoid double BOS)

I'm explicitly setting it to False here for clarity though

I think you misunderstand the purpose of this field, it's to explicitly signal that a model requires BOS, ie. for correct behavior when not using a chat template (completion or FIM).

According to the HF implementation, BOS is defined but not being added automatically when I try tokenizer("my prompt", return_tensors="pt"), so I hope this is correct

(For context, this is to match the behavior of HF transformers when I compare the activations via llama-eval-callback)

OK, so False is probably the correct value then.

johnlovesgoats · 2026-04-02T15:17:56Z

thank you guys

#4790) The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4 support (ggml-org/llama.cpp#21309 merged after the release was cut). This causes `llama-server` to fail with "unknown model architecture: gemma4" when loading Gemma 4 GGUFs. Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs build from the llama.cpp master branch which includes Gemma 4 support. Once a new upstream release is cut with Gemma 4, this can be reverted back to "latest". Changes: - setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default - setup.ps1: add $DefaultLlamaTag="master" maintainer default - install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master" Users can still override via UNSLOTH_LLAMA_TAG env var.

) * fix gguf conversion for audio/vision mmproj * fix test

ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309). Revert the temporary "master" default back to a pinned release tag. This eliminates the HTTP 422 errors from the prebuilt resolver (which could not find a release matching "master"), avoids unnecessary source builds, and restores prebuilt binary downloads on all platforms. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>

Kreijstal · 2026-04-02T19:30:58Z

no audio :(

ngxson · 2026-04-02T22:22:45Z

Audio will be a follow-up PR, I am working on that

Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including: - Gemma 4 model support (PR ggml-org#21309) - KV cache rotation for better quantization (ggml-org#21038) - Auto GPU memory fitting (llama_params_fit) - Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.) C++14/CUDA 7.5 compatibility fixes applied to merged code: - Replaced if constexpr with runtime if across CUDA files - Replaced constexpr __device__ functions with macros - Replaced structured bindings with .first/.second access - Replaced std::string_view/std::optional with std::string - Template specializations for ggml_cuda_cast (convert.cuh) - BF16 flash attention guarded behind CUDART_VERSION >= 11000 - Eager CUDA context init restored for accurate VRAM on non-VMM GPUs - Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault) Build system updates: - Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat - Added mtmd-image.cpp, httplib.cpp to build - convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility - gguf vocab.py fallback for old tokenizers library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge 59 upstream commits including: - model: support gemma 4 (vision + moe, no audio) (ggml-org#21309) - kv-cache: do not quantize SWA KV cache (ggml-org#21277) - Preserve RotorQuant exclusion from Hadamard rotation

) * fix gguf conversion for audio/vision mmproj * fix test

Rebased onto upstream master (b8672+) which includes Gemma 4 model support (PR ggml-org#21309, ggml-org#21326, ggml-org#21418). This enables loading Gemma 4 E2B/E4B GGUF models on-device via llama.cpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cherry-pick Gemma 4 (26B MoE + 31B dense) from upstream PR ggml-org#21309: - ISWA dual-cache (5:1 SWA:global ratio) - Variable head_dim (256 SWA / 512 global) - MoE with 128 experts top-8 + shared expert - K=V on global layers (attention_k_eq_v) - Gemma 4 tokenizer (byte_encode support) Head padding: pad heads to nearest multiple of 128 for FWHT alignment. Enables turbo quants on Phi-3 (96→128), Qwen3-0.6B (64→128), etc. Zero padding preserves inner products (Parseval's theorem). FA VEC dispatch: add head_dim=512 instances for all turbo + q8_0 + f16 type combinations, needed for Gemma 4 global attention layers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix gguf conversion for audio/vision mmproj

deb6543

ngxson requested review from a team, CISC and ggerganov as code owners April 2, 2026 14:47

danbev approved these changes Apr 2, 2026

View reviewed changes

ggerganov approved these changes Apr 2, 2026

View reviewed changes

fix test

ea8230e

ngxson requested a review from JohannesGaessler as a code owner April 2, 2026 14:56

ngxson requested review from danbev and ggerganov and removed request for JohannesGaessler April 2, 2026 14:56

danbev approved these changes Apr 2, 2026

View reviewed changes

ggerganov approved these changes Apr 2, 2026

View reviewed changes

ngxson merged commit 63f8fe0 into ggml-org:master Apr 2, 2026
49 of 50 checks passed

CISC reviewed Apr 2, 2026

View reviewed changes

ngxson changed the title ~~fix gguf conversion for audio/vision mmproj~~ model: support gemma 4 (vision + moe, no audio) Apr 2, 2026

github-actions bot added model Model specific testing Everything test related examples python python script changes labels Apr 2, 2026

danielhanchen mentioned this pull request Apr 2, 2026

fix(studio): build llama.cpp from master for Gemma 4 support unslothai/unsloth#4790

Merged

4 tasks

sawansri mentioned this pull request Apr 2, 2026

Add Gemma 4 Family lemonade-sdk/lemonade#1514

Merged

Vect0rM pushed a commit to AtomicBot-ai/atomic-llama-cpp-turboquant that referenced this pull request Apr 2, 2026

model, mtmd: fix gguf conversion for audio/vision mmproj (ggml-org#21309

ad19777

) * fix gguf conversion for audio/vision mmproj * fix test

danielhanchen mentioned this pull request Apr 2, 2026

fix(studio): pin llama.cpp to b8637 (Gemma 4 support) unslothai/unsloth#4796

Merged

4 tasks

sinkingsugar mentioned this pull request Apr 2, 2026

Bump llama.cpp to b8639 (Gemma 4 support) fragcolor-xyz/shards#1261

Open

5 tasks

crashr mentioned this pull request Apr 2, 2026

Eval bug: Gemma 4 audio support is missing #21325

Open

wordingone pushed a commit to wordingone/llama-cpp-turboquant-cuda that referenced this pull request Apr 6, 2026

model, mtmd: fix gguf conversion for audio/vision mmproj (ggml-org#21309

c800cf2

) * fix gguf conversion for audio/vision mmproj * fix test

ggerganov mentioned this pull request Apr 6, 2026

convert : set "add bos" == True for Gemma 4 #21500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: support gemma 4 (vision + moe, no audio)#21309

model: support gemma 4 (vision + moe, no audio)#21309
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/fix_nextg_model

ngxson commented Apr 2, 2026

Uh oh!

ngxson commented Apr 2, 2026

Uh oh!

Uh oh!

osanseviero commented Apr 2, 2026

Uh oh!

CISC Apr 2, 2026

Uh oh!

ngxson Apr 2, 2026

Uh oh!

CISC Apr 2, 2026

Uh oh!

ngxson Apr 2, 2026

Uh oh!

ngxson Apr 2, 2026

Uh oh!

CISC Apr 2, 2026

Uh oh!

johnlovesgoats commented Apr 2, 2026

Uh oh!

Kreijstal commented Apr 2, 2026

Uh oh!

ngxson commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

		self.gguf_writer.add_add_space_prefix(False)
		self.gguf_writer.add_add_bos_token(False) # already added via the chat template

Conversation

ngxson commented Apr 2, 2026

Overview

Requirements

Uh oh!

ngxson commented Apr 2, 2026

Uh oh!

Uh oh!

osanseviero commented Apr 2, 2026

Uh oh!

CISC Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

johnlovesgoats commented Apr 2, 2026

Uh oh!

Kreijstal commented Apr 2, 2026

Uh oh!

ngxson commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants