Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888
Closed
esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom
esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c
Closed
Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c
esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom
esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c
Conversation
This commit implements full multimodal (image) support for the Gemma3n
model, which uses the MobileNetV5 architecture as its vision encoder
(instead of SigLIP used in Gemma3).
## Architecture Implementation
**MobileNetV5 Vision Encoder:**
- Stem convolution layer with RMSNorm2d and GELU activation
- Universal Inverted Residual blocks with expansion, depthwise conv,
Squeeze-Excitation, and projection phases
- Multi-Scale Fusion Adapter (MSFA) for combining features at different
resolutions
- Support for Multi-Query Attention (MQA) blocks within the CNN architecture
**Key Components:**
- 2D RMS normalization for feature maps
- Squeeze-and-Excitation layers for channel attention
- Approximate GELU activation throughout
- Non-causal attention during vision processing (same as Gemma3)
## Changes by File
### C++ Implementation
- `tools/mtmd/clip-impl.h`: Added PROJECTOR_TYPE_GEMMA3N enum and
mobilenetv5_block structure
- `tools/mtmd/clip.cpp`:
- Added MobileNetV5 model weights to clip_model struct
- Implemented build_mobilenetv5() encoder function
- Implemented helper functions: rms_norm_2d, build_se_layer,
build_inverted_residual_block
- Added weight loading logic for MobileNetV5 components
- Added clip_is_gemma3n() helper function
- `tools/mtmd/mtmd.cpp`: Updated non-causal attention logic to include
GEMMA3N
### Python Conversion
- `convert_hf_to_gguf.py`: Added Gemma3nVisionModel converter class
for handling MobileNetV5 architecture conversion
- `gguf-py/gguf/constants.py`: Added GEMMA3N to VISION_PROJECTOR_TYPE enum
### Documentation
- `docs/multimodal/gemma3n.md`: Comprehensive documentation for Gemma3n
vision support
- `tools/mtmd/README.md`: Updated to list Gemma3n as supported model
## Integration Pattern
The implementation follows the same integration pattern as Gemma3:
1. Image → Vision Encoder → Feature Extraction
2. Features → RMSNorm → Soft Embedding Norm → Linear Projection
3. Projected Embeddings → Language Model (with non-causal attention)
Output: 256 soft tokens representing the image content
## Testing Notes
This implementation provides the foundation for Gemma3n vision support.
Testing with actual Gemma3n models will be needed to:
- Verify tensor name mappings from HuggingFace models
- Tune MSFA feature aggregation parameters
- Validate block stride and skip connection logic
- Confirm image preprocessing parameters
## Architecture Differences
- **Gemma3**: SigLIP vision encoder (ViT-based) with 2x2 pooling
- **Gemma3n**: MobileNetV5 vision encoder (CNN-based) with MSFA
Both use the same projection mechanism to language model embedding space.
Corrected several API usage errors to match ggml library signatures: 1. Fixed rms_norm_2d: Use ggml_rms_norm() directly instead of manual implementation with ggml_add1() that expected tensor parameter 2. Fixed depthwise convolution: Changed ggml_conv_depthwise_2d() to correct function name ggml_conv_2d_dw() 3. Fixed ggml_upscale: Added missing mode parameter (GGML_SCALE_MODE_BILINEAR) to function call 4. Removed unused variables: Eliminated batch_size, feat_h, feat_w variables that were declared but not used 5. Added function declaration: Added clip_is_gemma3n() declaration to clip.h to match pattern of other helper functions All changes maintain the intended MobileNetV5 architecture behavior while using the correct ggml API functions.
Updated ModelBase.register decorators to resolve architecture name conflicts: 1. Gemma3nVisionModel: Now registered with 'Gemma3nForConditionalGeneration' (the actual architecture name from HuggingFace models) 2. Gemma3NModel: Changed to register with 'Gemma3nForCausalLM' (for text-only models without vision) This fixes the 'Model Gemma3nForConditionalGeneration is not supported' error when converting models with --mmproj flag. The vision model class will now be correctly selected for multimodal Gemma3n models.
MobileNetV5 is a CNN-based architecture and doesn't have the standard transformer 'num_hidden_layers' field. Added __init__ override in Gemma3nVisionModel to: 1. Try alternative keys: 'num_stages', 'num_layers', 'depth' 2. Default to block_count=0 if no matching keys found (valid for CNN) This fixes the 'NoneType' object cannot be interpreted as an integer error when converting Gemma3n models with --mmproj flag.
Set n_block_keys to empty list as class attribute to prevent parent __init__ from trying to find non-existent transformer layer keys. After parent init, explicitly set block_count=0 for CNN architecture. MobileNetV5 is a CNN with hardcoded architecture defined by arch_def strings, not configurable layer counts in the config file.
The parent MmprojModel.__init__ calls find_hparam(n_block_keys) and immediately uses the result. When n_block_keys is empty, find_hparam returns None, causing TypeError when used as integer for range(). Solution: Override find_hparam to return 0 for empty keys list, which is appropriate for CNN architectures like MobileNetV5 that don't have discrete transformer layers to count.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR