-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Model] Add Gemma3 GGUF multimodal support #27772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Gemma3 GGUF multimodal models, which is a significant feature enhancement. The implementation correctly handles the detection of multimodal GGUF files and extends the model loader to accommodate vision tower weights from separate mmproj.gguf files. The changes are well-isolated to minimize impact on existing models. However, there are several areas for improvement. A critical issue was found where a function is missing a return statement, which could lead to runtime errors. Additionally, there are multiple instances of hardcoded configurations for the vision tower and image processor, as well as code duplication for model detection logic. These reduce maintainability and make the implementation brittle to future changes. I've also identified a bug in the runai_safetensors_weights_iterator that will cause an incorrect progress bar. Addressing these points will improve the robustness and maintainability of this new feature.
|
This pull request has merge conflicts that must be resolved before it can be |
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR introduces too many gemma3-mm GGUF specific hardcoded code.
Besides this, I'm a bit confused why we need to use GGUFQuantMethod for a totally unquantized mm_proj checkpoint as well.
3f11c5d to
784af2c
Compare
c71ca33 to
acff4c0
Compare
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think this PR is still under poor design, because there are too many gemma3 specific codes outside model implementation...
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
This commit enhances GGUF model loading with two key improvements:
1. Dynamic vocab_size extraction and override:
- Added extract_vocab_size_from_gguf() to read vocab_size from
token_embd.weight tensor shape
- Updated maybe_patch_hf_config_from_gguf() to automatically override
vocab_size from GGUF file for all GGUF models
- Enables loading models with extended vocabularies (e.g., Unsloth)
without manual config editing
2. Improved automatic weight mapping for multimodal models:
- Enhanced find_hf_name_in_tensor_map() to handle naming convention
differences between HF and gguf-py:
* Strips 'language_model.' prefix for multimodal models
* Converts '_weight' suffix to '.weight' (Gemma3 compatibility)
- Removed hardcoded Gemma3-specific weight mappings in favor of
automatic mapping via gguf-py's vision_name_map
- Removed redundant prefix fallback logic
I will:
- add utomatic compatibility with models using extended vocabularies
- make a more maintainable code with no model-specific hardcoded mappings
- introduce a robust handling of HF/gguf-py naming convention mismatches
- work for all multimodal GGUF models, not just Gemma3
Technical details:
- GGUF embedding tensor format: [hidden_size, vocab_size]
- vocab_size extracted from shape[1] as source of truth
- Uses get_text_config() to handle both regular and multimodal configs
- Graceful fallback if vocab_size extraction fails
Signed-off-by: Luciano Martins <[email protected]>
- Change gemma3-transformers test to use model_impl='auto' instead of 'transformers' to avoid missing generate_attention_masks() method in TransformersMultiModalForCausalLM - Change quantized models test to use dtype='bfloat16' instead of 'half' since gemma3_text models don't support float16 due to numerical instability These are test configuration fixes for pre-existing bugs unrelated to GGUF changes. Signed-off-by: Luciano Martins <[email protected]>
c89868a to
f784296
Compare
Remove extract_vocab_size_from_gguf() and vocab_size override logic as transformers now handles this internally via modeling_gguf_pytorch_utils. Also fixed the tests/models/multimodal/generation/test_common.py to use HuggingFace implementation for Gemma3 testing. Signed-off-by: Luciano Martins <[email protected]>
f784296 to
5a245bd
Compare
Prevent AttributeError when calling generate_attention_masks() on models that don't implement this method (e.g., TransformersMultiModalForCausalLM). This fixes the errors on multi-modal-models-test CI test. The condition now checks: 1. uses_custom_attention_masks flag is set 2. multimodal features are present 3. model has generate_attention_masks method This ensures the method is only called on models that support custom attention masks (e.g., Gemma3ForConditionalGeneration). Signed-off-by: Luciano Martins <[email protected]>
|
Hey @Isotr0py - all tests are passing, with 2 exceptions not related to Gemma3 or the PR work:
It is failing with this access error for the test image to be used:
It is failing for EAGLE-LLAMA3.1-Instruct-8B (before even getting to any Gemma3 test): And it is failing due to GPU OOM errors: can we proceed with the merge? |
|
Yeah they are not related, let's merge |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Luciano Martins <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Due to the latest changes from upstream, gemma3 is failing to compile on HPU vllm-project/vllm#27772 vllm-project/vllm#28842 -replace unfold to view/reshape -replace text embedding to avoid dynamic shape -remove merge_multimodal replacement since masked_scatter issue is fixed -enable back gemma3 model test --------- Signed-off-by: Jimin Ha <[email protected]>
[Model] Add Gemma3 GGUF multimodal support
Summary
This PR enables full multimodal inference support for Gemma3 models in GGUF format, allowing users to run quantized Gemma3 multimodal models with both text-only and image+text prompts. The implementation adds automatic detection and loading of
mmproj.gguffiles for vision tower weights while maintaining complete isolation to ensure zero impact on existing Gemma3 HuggingFace models or other model architectures.Resolves: #22497
New Functionality
1. Automatic GGUF Multimodal Detection
The model configuration now automatically detects when a Gemma3 GGUF model has an accompanying
mmproj.gguffile and switches to the multimodal architecture:vllm/config/model.py_detect_gguf_multimodal_gemma3()checks formmproj-*.ggufin model directorymodel_cls = "Gemma3ForConditionalGeneration"when multimodal projector is foundGemma3ForCausalLMfor text-only GGUF modelsExample usage:
2. GGUF Vision Tower Weight Loading
Enhanced GGUF loader to handle multimodal weights from separate
mmproj.gguffiles:vllm/model_executor/model_loader/gguf_loader.py,weight_utils.pyv.*.attn.{q,k,v,out}→ Vision attention weightsv.*.mlp.{fc1,fc2}→ Vision FFN weightsv.post_ln.{weight,bias}→ Post-layernorm weightsmm.0.{weight,bias}→ Multimodal projector weightsfloat16/bfloat16(GGUF spec)is_gguf_weight = True3. GGUF Processor Loading
Processors now correctly load from tokenizer path for GGUF models:
vllm/transformers_utils/processor.py.gguffile paths4. V1 Engine Multimodal Support
Added multimodal embedding gathering in V1 GPU model runner:
vllm/v1/worker/gpu_model_runner.py_gather_mm_embeddings()extracts vision embeddings after encoder executionget_multimodal_embeddings()for proper mergingIsolation & Safety Guarantees
All changes are strictly scoped to Gemma3 GGUF multimodal models only. No other models or formats are affected.
Isolation Verification
if quantization == "gguf"if quantization in ("gguf", "inc")if self.model.endswith('.gguf')if self._is_ggufNon-Regression Tests
Comprehensive testing confirms zero impact on other model types:
No changes to:
Testing
Test Environment
google/gemma-3-1b-it(HF text-only)google/gemma-3-4b-it(HF text-only & multimodal)google/gemma-3-4b-it-q4_0-gguf(GGUF text-only & multimodal)Backward Compatibility
100% backward compatible - No breaking changes to existing functionality:
Checklist
[Model] Add Gemma3 GGUF multimodal supportAdditional Context
This PR builds upon the recently merged text-only GGUF support for Gemma3 and extends it to support multimodal inference. The implementation carefully preserves all upstream bugfixes, including the recent newline token ordering fix (#27538).
Tested on: vLLM main branch (commit
b5d90f740, Oct 29 2025)Signed-off-by: Luciano Martins [email protected]