model: support gemma 4 (vision + moe, no audio)#21309
model: support gemma 4 (vision + moe, no audio)#21309ngxson merged 2 commits intoggml-org:masterfrom
Conversation
|
Running CI on my fork for faster result: ngxson#95 |
|
lgtm |
| self.gguf_writer.add_add_space_prefix(False) | ||
| self.gguf_writer.add_add_bos_token(False) # already added via the chat template |
There was a problem hiding this comment.
If already added in the chat template that means it should be True, or is it not a single token?
There was a problem hiding this comment.
the chat template already had {{ bos_token }}, so add_bos_token is not necessary (though even if it set to True here, llama.cpp do have a logic to avoid double BOS)
I'm explicitly setting it to False here for clarity though
There was a problem hiding this comment.
I think you misunderstand the purpose of this field, it's to explicitly signal that a model requires BOS, ie. for correct behavior when not using a chat template (completion or FIM).
There was a problem hiding this comment.
According to the HF implementation, BOS is defined but not being added automatically when I try tokenizer("my prompt", return_tensors="pt"), so I hope this is correct
There was a problem hiding this comment.
(For context, this is to match the behavior of HF transformers when I compare the activations via llama-eval-callback)
There was a problem hiding this comment.
OK, so False is probably the correct value then.
|
thank you guys |
#4790) The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4 support (ggml-org/llama.cpp#21309 merged after the release was cut). This causes `llama-server` to fail with "unknown model architecture: gemma4" when loading Gemma 4 GGUFs. Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs build from the llama.cpp master branch which includes Gemma 4 support. Once a new upstream release is cut with Gemma 4, this can be reverted back to "latest". Changes: - setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default - setup.ps1: add $DefaultLlamaTag="master" maintainer default - install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master" Users can still override via UNSLOTH_LLAMA_TAG env var.
ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309). Revert the temporary "master" default back to a pinned release tag. This eliminates the HTTP 422 errors from the prebuilt resolver (which could not find a release matching "master"), avoids unnecessary source builds, and restores prebuilt binary downloads on all platforms. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
|
no audio :( |
|
Audio will be a follow-up PR, I am working on that |
Merges ggml-org/llama.cpp upstream (d23355a..7992aa7) including: - Gemma 4 model support (PR ggml-org#21309) - KV cache rotation for better quantization (ggml-org#21038) - Auto GPU memory fitting (llama_params_fit) - Many new model architectures (Qwen3.5, Kimi K2, LFM2, etc.) C++14/CUDA 7.5 compatibility fixes applied to merged code: - Replaced if constexpr with runtime if across CUDA files - Replaced constexpr __device__ functions with macros - Replaced structured bindings with .first/.second access - Replaced std::string_view/std::optional with std::string - Template specializations for ggml_cuda_cast (convert.cuh) - BF16 flash attention guarded behind CUDART_VERSION >= 11000 - Eager CUDA context init restored for accurate VRAM on non-VMM GPUs - Jinja C++17 structured bindings fixed (caused Qwen 3.5 segfault) Build system updates: - Added hf-cache-stub.cpp, server-tools-stub.cpp for C++14 compat - Added mtmd-image.cpp, httplib.cpp to build - convert_hf_to_gguf.py patched for PyTorch 1.13 compatibility - gguf vocab.py fallback for old tokenizers library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge 59 upstream commits including: - model: support gemma 4 (vision + moe, no audio) (ggml-org#21309) - kv-cache: do not quantize SWA KV cache (ggml-org#21277) - Preserve RotorQuant exclusion from Hadamard rotation
Rebased onto upstream master (b8672+) which includes Gemma 4 model support (PR ggml-org#21309, ggml-org#21326, ggml-org#21418). This enables loading Gemma 4 E2B/E4B GGUF models on-device via llama.cpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cherry-pick Gemma 4 (26B MoE + 31B dense) from upstream PR ggml-org#21309: - ISWA dual-cache (5:1 SWA:global ratio) - Variable head_dim (256 SWA / 512 global) - MoE with 128 experts top-8 + shared expert - K=V on global layers (attention_k_eq_v) - Gemma 4 tokenizer (byte_encode support) Head padding: pad heads to nearest multiple of 128 for FWHT alignment. Enables turbo quants on Phi-3 (96→128), Qwen3-0.6B (64→128), etc. Zero padding preserves inner products (Parseval's theorem). FA VEC dispatch: add head_dim=512 instances for all turbo + q8_0 + f16 type combinations, needed for Gemma 4 global attention layers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Overview
Fix a bug where model with both vision/audio cannot be converted properly
Requirements