Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c by esemsc-ss2524 · Pull Request #17888 · ggml-org/llama.cpp

esemsc-ss2524 · 2025-12-09T16:36:30Z

Make sure to read the contributing guidelines before submitting a PR

This commit implements full multimodal (image) support for the Gemma3n model, which uses the MobileNetV5 architecture as its vision encoder (instead of SigLIP used in Gemma3). ## Architecture Implementation **MobileNetV5 Vision Encoder:** - Stem convolution layer with RMSNorm2d and GELU activation - Universal Inverted Residual blocks with expansion, depthwise conv, Squeeze-Excitation, and projection phases - Multi-Scale Fusion Adapter (MSFA) for combining features at different resolutions - Support for Multi-Query Attention (MQA) blocks within the CNN architecture **Key Components:** - 2D RMS normalization for feature maps - Squeeze-and-Excitation layers for channel attention - Approximate GELU activation throughout - Non-causal attention during vision processing (same as Gemma3) ## Changes by File ### C++ Implementation - `tools/mtmd/clip-impl.h`: Added PROJECTOR_TYPE_GEMMA3N enum and mobilenetv5_block structure - `tools/mtmd/clip.cpp`: - Added MobileNetV5 model weights to clip_model struct - Implemented build_mobilenetv5() encoder function - Implemented helper functions: rms_norm_2d, build_se_layer, build_inverted_residual_block - Added weight loading logic for MobileNetV5 components - Added clip_is_gemma3n() helper function - `tools/mtmd/mtmd.cpp`: Updated non-causal attention logic to include GEMMA3N ### Python Conversion - `convert_hf_to_gguf.py`: Added Gemma3nVisionModel converter class for handling MobileNetV5 architecture conversion - `gguf-py/gguf/constants.py`: Added GEMMA3N to VISION_PROJECTOR_TYPE enum ### Documentation - `docs/multimodal/gemma3n.md`: Comprehensive documentation for Gemma3n vision support - `tools/mtmd/README.md`: Updated to list Gemma3n as supported model ## Integration Pattern The implementation follows the same integration pattern as Gemma3: 1. Image → Vision Encoder → Feature Extraction 2. Features → RMSNorm → Soft Embedding Norm → Linear Projection 3. Projected Embeddings → Language Model (with non-causal attention) Output: 256 soft tokens representing the image content ## Testing Notes This implementation provides the foundation for Gemma3n vision support. Testing with actual Gemma3n models will be needed to: - Verify tensor name mappings from HuggingFace models - Tune MSFA feature aggregation parameters - Validate block stride and skip connection logic - Confirm image preprocessing parameters ## Architecture Differences - **Gemma3**: SigLIP vision encoder (ViT-based) with 2x2 pooling - **Gemma3n**: MobileNetV5 vision encoder (CNN-based) with MSFA Both use the same projection mechanism to language model embedding space.

Corrected several API usage errors to match ggml library signatures: 1. Fixed rms_norm_2d: Use ggml_rms_norm() directly instead of manual implementation with ggml_add1() that expected tensor parameter 2. Fixed depthwise convolution: Changed ggml_conv_depthwise_2d() to correct function name ggml_conv_2d_dw() 3. Fixed ggml_upscale: Added missing mode parameter (GGML_SCALE_MODE_BILINEAR) to function call 4. Removed unused variables: Eliminated batch_size, feat_h, feat_w variables that were declared but not used 5. Added function declaration: Added clip_is_gemma3n() declaration to clip.h to match pattern of other helper functions All changes maintain the intended MobileNetV5 architecture behavior while using the correct ggml API functions.

Updated ModelBase.register decorators to resolve architecture name conflicts: 1. Gemma3nVisionModel: Now registered with 'Gemma3nForConditionalGeneration' (the actual architecture name from HuggingFace models) 2. Gemma3NModel: Changed to register with 'Gemma3nForCausalLM' (for text-only models without vision) This fixes the 'Model Gemma3nForConditionalGeneration is not supported' error when converting models with --mmproj flag. The vision model class will now be correctly selected for multimodal Gemma3n models.

MobileNetV5 is a CNN-based architecture and doesn't have the standard transformer 'num_hidden_layers' field. Added __init__ override in Gemma3nVisionModel to: 1. Try alternative keys: 'num_stages', 'num_layers', 'depth' 2. Default to block_count=0 if no matching keys found (valid for CNN) This fixes the 'NoneType' object cannot be interpreted as an integer error when converting Gemma3n models with --mmproj flag.

Set n_block_keys to empty list as class attribute to prevent parent __init__ from trying to find non-existent transformer layer keys. After parent init, explicitly set block_count=0 for CNN architecture. MobileNetV5 is a CNN with hardcoded architecture defined by arch_def strings, not configurable layer counts in the config file.

The parent MmprojModel.__init__ calls find_hparam(n_block_keys) and immediately uses the result. When n_block_keys is empty, find_hparam returns None, causing TypeError when used as integer for range(). Solution: Override find_hparam to return 0 for empty keys list, which is appropriate for CNN architectures like MobileNetV5 that don't have discrete transformer layers to count.

claude added 6 commits December 8, 2025 20:42

esemsc-ss2524 requested review from CISC and ngxson as code owners December 9, 2025 16:36

esemsc-ss2524 closed this Dec 9, 2025

github-actions bot added documentation Improvements or additions to documentation examples python python script changes labels Dec 9, 2025

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888

Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888
esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom
esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c

esemsc-ss2524 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

esemsc-ss2524 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants