Skip to content

Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888

Closed
esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom
esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c
Closed

Claude/add gemma3n vision encoder 01 ev m4 lgt bq96z y6 xm sm dr9c#17888
esemsc-ss2524 wants to merge 6 commits intoggml-org:masterfrom
esemsc-ss2524:claude/add-gemma3n-vision-encoder-01EvM4LgtBQ96zY6XMSmDR9c

Conversation

@esemsc-ss2524
Copy link
Copy Markdown

Make sure to read the contributing guidelines before submitting a PR

This commit implements full multimodal (image) support for the Gemma3n
model, which uses the MobileNetV5 architecture as its vision encoder
(instead of SigLIP used in Gemma3).

## Architecture Implementation

**MobileNetV5 Vision Encoder:**
- Stem convolution layer with RMSNorm2d and GELU activation
- Universal Inverted Residual blocks with expansion, depthwise conv,
  Squeeze-Excitation, and projection phases
- Multi-Scale Fusion Adapter (MSFA) for combining features at different
  resolutions
- Support for Multi-Query Attention (MQA) blocks within the CNN architecture

**Key Components:**
- 2D RMS normalization for feature maps
- Squeeze-and-Excitation layers for channel attention
- Approximate GELU activation throughout
- Non-causal attention during vision processing (same as Gemma3)

## Changes by File

### C++ Implementation
- `tools/mtmd/clip-impl.h`: Added PROJECTOR_TYPE_GEMMA3N enum and
  mobilenetv5_block structure
- `tools/mtmd/clip.cpp`:
  - Added MobileNetV5 model weights to clip_model struct
  - Implemented build_mobilenetv5() encoder function
  - Implemented helper functions: rms_norm_2d, build_se_layer,
    build_inverted_residual_block
  - Added weight loading logic for MobileNetV5 components
  - Added clip_is_gemma3n() helper function
- `tools/mtmd/mtmd.cpp`: Updated non-causal attention logic to include
  GEMMA3N

### Python Conversion
- `convert_hf_to_gguf.py`: Added Gemma3nVisionModel converter class
  for handling MobileNetV5 architecture conversion
- `gguf-py/gguf/constants.py`: Added GEMMA3N to VISION_PROJECTOR_TYPE enum

### Documentation
- `docs/multimodal/gemma3n.md`: Comprehensive documentation for Gemma3n
  vision support
- `tools/mtmd/README.md`: Updated to list Gemma3n as supported model

## Integration Pattern

The implementation follows the same integration pattern as Gemma3:
1. Image → Vision Encoder → Feature Extraction
2. Features → RMSNorm → Soft Embedding Norm → Linear Projection
3. Projected Embeddings → Language Model (with non-causal attention)

Output: 256 soft tokens representing the image content

## Testing Notes

This implementation provides the foundation for Gemma3n vision support.
Testing with actual Gemma3n models will be needed to:
- Verify tensor name mappings from HuggingFace models
- Tune MSFA feature aggregation parameters
- Validate block stride and skip connection logic
- Confirm image preprocessing parameters

## Architecture Differences

- **Gemma3**: SigLIP vision encoder (ViT-based) with 2x2 pooling
- **Gemma3n**: MobileNetV5 vision encoder (CNN-based) with MSFA

Both use the same projection mechanism to language model embedding space.
Corrected several API usage errors to match ggml library signatures:

1. Fixed rms_norm_2d: Use ggml_rms_norm() directly instead of manual
   implementation with ggml_add1() that expected tensor parameter

2. Fixed depthwise convolution: Changed ggml_conv_depthwise_2d() to
   correct function name ggml_conv_2d_dw()

3. Fixed ggml_upscale: Added missing mode parameter
   (GGML_SCALE_MODE_BILINEAR) to function call

4. Removed unused variables: Eliminated batch_size, feat_h, feat_w
   variables that were declared but not used

5. Added function declaration: Added clip_is_gemma3n() declaration
   to clip.h to match pattern of other helper functions

All changes maintain the intended MobileNetV5 architecture behavior
while using the correct ggml API functions.
Updated ModelBase.register decorators to resolve architecture name conflicts:

1. Gemma3nVisionModel: Now registered with 'Gemma3nForConditionalGeneration'
   (the actual architecture name from HuggingFace models)

2. Gemma3NModel: Changed to register with 'Gemma3nForCausalLM'
   (for text-only models without vision)

This fixes the 'Model Gemma3nForConditionalGeneration is not supported'
error when converting models with --mmproj flag. The vision model class
will now be correctly selected for multimodal Gemma3n models.
MobileNetV5 is a CNN-based architecture and doesn't have the standard
transformer 'num_hidden_layers' field. Added __init__ override in
Gemma3nVisionModel to:

1. Try alternative keys: 'num_stages', 'num_layers', 'depth'
2. Default to block_count=0 if no matching keys found (valid for CNN)

This fixes the 'NoneType' object cannot be interpreted as an integer
error when converting Gemma3n models with --mmproj flag.
Set n_block_keys to empty list as class attribute to prevent parent
__init__ from trying to find non-existent transformer layer keys.
After parent init, explicitly set block_count=0 for CNN architecture.

MobileNetV5 is a CNN with hardcoded architecture defined by arch_def
strings, not configurable layer counts in the config file.
The parent MmprojModel.__init__ calls find_hparam(n_block_keys) and
immediately uses the result. When n_block_keys is empty, find_hparam
returns None, causing TypeError when used as integer for range().

Solution: Override find_hparam to return 0 for empty keys list,
which is appropriate for CNN architectures like MobileNetV5 that
don't have discrete transformer layers to count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants