Skip to content

model: support gemma 4 (vision + moe, no audio)#21309

Merged
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/fix_nextg_model
Apr 2, 2026
Merged

model: support gemma 4 (vision + moe, no audio)#21309
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/fix_nextg_model

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Apr 2, 2026

Overview

Fix a bug where model with both vision/audio cannot be converted properly

Requirements

@ngxson ngxson requested review from a team, CISC and ggerganov as code owners April 2, 2026 14:47
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 2, 2026

Running CI on my fork for faster result: ngxson#95

@ngxson ngxson requested a review from JohannesGaessler as a code owner April 2, 2026 14:56
@ngxson ngxson requested review from danbev and ggerganov and removed request for JohannesGaessler April 2, 2026 14:56
@ngxson ngxson merged commit 63f8fe0 into ggml-org:master Apr 2, 2026
49 of 50 checks passed
@osanseviero
Copy link
Copy Markdown

lgtm

Comment on lines +7477 to +7478
self.gguf_writer.add_add_space_prefix(False)
self.gguf_writer.add_add_bos_token(False) # already added via the chat template
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already added in the chat template that means it should be True, or is it not a single token?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chat template already had {{ bos_token }}, so add_bos_token is not necessary (though even if it set to True here, llama.cpp do have a logic to avoid double BOS)

I'm explicitly setting it to False here for clarity though

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstand the purpose of this field, it's to explicitly signal that a model requires BOS, ie. for correct behavior when not using a chat template (completion or FIM).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HF implementation, BOS is defined but not being added automatically when I try tokenizer("my prompt", return_tensors="pt"), so I hope this is correct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For context, this is to match the behavior of HF transformers when I compare the activations via llama-eval-callback)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so False is probably the correct value then.

@johnlovesgoats
Copy link
Copy Markdown

thank you guys

@ngxson ngxson changed the title fix gguf conversion for audio/vision mmproj model: support gemma 4 (vision + moe, no audio) Apr 2, 2026
@github-actions github-actions bot added model Model specific testing Everything test related examples python python script changes labels Apr 2, 2026
danielhanchen added a commit to unslothai/unsloth that referenced this pull request Apr 2, 2026
#4790)

The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4
support (ggml-org/llama.cpp#21309 merged after the release was cut).
This causes `llama-server` to fail with "unknown model architecture:
gemma4" when loading Gemma 4 GGUFs.

Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs
build from the llama.cpp master branch which includes Gemma 4 support.
Once a new upstream release is cut with Gemma 4, this can be reverted
back to "latest".

Changes:
- setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default
- setup.ps1: add $DefaultLlamaTag="master" maintainer default
- install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master"

Users can still override via UNSLOTH_LLAMA_TAG env var.
Vect0rM pushed a commit to AtomicBot-ai/atomic-llama-cpp-turboquant that referenced this pull request Apr 2, 2026
)

* fix gguf conversion for audio/vision mmproj

* fix test
danielhanchen added a commit to unslothai/unsloth that referenced this pull request Apr 2, 2026
ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309).
Revert the temporary "master" default back to a pinned release tag.

This eliminates the HTTP 422 errors from the prebuilt resolver (which
could not find a release matching "master"), avoids unnecessary source
builds, and restores prebuilt binary downloads on all platforms.

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
@Kreijstal
Copy link
Copy Markdown

no audio :(

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 2, 2026

Audio will be a follow-up PR, I am working on that

Wowfunhappy added a commit to Wowfunhappy/llama.cpp that referenced this pull request Apr 3, 2026
Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including:
- Gemma 4 model support (PR ggml-org#21309)
- KV cache rotation for better quantization (ggml-org#21038)
- Auto GPU memory fitting (llama_params_fit)
- Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.)

C++14/CUDA 7.5 compatibility fixes applied to merged code:
- Replaced if constexpr with runtime if across CUDA files
- Replaced constexpr __device__ functions with macros
- Replaced structured bindings with .first/.second access
- Replaced std::string_view/std::optional with std::string
- Template specializations for ggml_cuda_cast (convert.cuh)
- BF16 flash attention guarded behind CUDART_VERSION >= 11000
- Eager CUDA context init restored for accurate VRAM on non-VMM GPUs
- Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault)

Build system updates:
- Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat
- Added mtmd-image.cpp, httplib.cpp to build
- convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility
- gguf vocab.py fallback for old tokenizers library

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
luvwinnie added a commit to luvwinnie/llama.cpp that referenced this pull request Apr 4, 2026
Merge 59 upstream commits including:
- model: support gemma 4 (vision + moe, no audio) (ggml-org#21309)
- kv-cache: do not quantize SWA KV cache (ggml-org#21277)
- Preserve RotorQuant exclusion from Hadamard rotation
wordingone pushed a commit to wordingone/llama-cpp-turboquant-cuda that referenced this pull request Apr 6, 2026
)

* fix gguf conversion for audio/vision mmproj

* fix test
kgluszczyk added a commit to Creatiwi-ai/llama.cpp that referenced this pull request Apr 6, 2026
Rebased onto upstream master (b8672+) which includes Gemma 4 model
support (PR ggml-org#21309, ggml-org#21326, ggml-org#21418). This enables loading Gemma 4
E2B/E4B GGUF models on-device via llama.cpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026
Cherry-pick Gemma 4 (26B MoE + 31B dense) from upstream PR ggml-org#21309:
- ISWA dual-cache (5:1 SWA:global ratio)
- Variable head_dim (256 SWA / 512 global)
- MoE with 128 experts top-8 + shared expert
- K=V on global layers (attention_k_eq_v)
- Gemma 4 tokenizer (byte_encode support)

Head padding: pad heads to nearest multiple of 128 for FWHT alignment.
Enables turbo quants on Phi-3 (96→128), Qwen3-0.6B (64→128), etc.
Zero padding preserves inner products (Parseval's theorem).

FA VEC dispatch: add head_dim=512 instances for all turbo + q8_0 + f16
type combinations, needed for Gemma 4 global attention layers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants