-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Model] Restore Gemma3 GGUF multimodal support with GGUF-only guards #29198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively restores multimodal support for Gemma3 GGUF models by introducing robust file-format-based guards. The approach is sound and the defense-in-depth mechanism is a good practice to prevent regressions on HuggingFace models. My review focuses on performance optimizations within the newly restored generate_attention_masks method, suggesting more idiomatic and efficient PyTorch constructs to improve performance on this critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ed24445 to
7ac2e2d
Compare
|
Hi @Isotr0py / @DarkLight1337, It is a quick one - reintroducing #27772, but now with guardrails to avoid the problems that caused the PR to be reverted via #28995. It is pretty much all reviewed (as not much changed since #27772) and ready to go :) |
|
hey @DarkLight1337, The failing test is not related to my PR: can we move forward? |
a5dc93c to
111b98a
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
111b98a to
2551817
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
a54cb9c to
f467714
Compare
|
Can you update the GGUF multimodal test to align HF format gemma3 test's input for validation? I observed it failed with custom attention mask before: "gemma3": VLMTestInfo(
models=["google/gemma-3-4b-it"],
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
single_image_prompts=IMAGE_ASSETS.prompts(
{
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
"cherry_blossom": "<start_of_image>What is the season?",
}
),
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
max_model_len=4096,
max_num_seqs=2,
auto_cls=AutoModelForImageTextToText,
vllm_runner_kwargs={"mm_processor_kwargs": {"do_pan_and_scan": True}},
patch_hf_runner=model_utils.gemma3_patch_hf_runner,
num_logprobs=10,
image_size_factors=[(0.25, 0.5, 1.0)],
), |
f664627 to
cb81c6a
Compare
hey @Isotr0py - I've added standardized GGUF multimodal tests following the Changes in
The test compares GGUF vLLM output against HF safetensors vLLM output using |
|
hey @Isotr0py, can you re-run the |
Hmmm, we need to compare against HF safetensors HF output instead of vLLM output. |
|
hey @Isotr0py -do you mean running against HF models with just double checking because I modeled the test after the existing Should I update the |
Yes, I meant updating |
cb81c6a to
b3bff2a
Compare
|
done, @Isotr0py - now the whole |
|
@Isotr0py - do you know the memory limits of the containers running the tests? my multimodal test failed with segment fault when trying to load the with the stack pointing to: (line 121 from Also it is suspicious that I have a larger GPU on my test environment and this test runned with no errors: |
|
just for reference: the and the |
|
@lucianommartins I just updated the GGUF multimodal test with BTW, if I disabled custom attention mask with Can you please check the custom attention mask implementation?
|
|
Hi @Isotr0py - there was a bug on the crops manipulation when dealing with batched requests (it is how your scenario caught the issue). It is fixed now and the |
|
fyi @Isotr0py - |
Restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in vllm-project#28995. Implements robust GGUF-only guards to ensure the feature only applies to GGUF models and does not affect HF models. Changes: - Add uses_custom_attention_masks() utility with GGUF file format check - Add uses_custom_attention_masks property to ModelConfig - Initialize uses_custom_attention_masks in GPUModelRunner - Restore generate_attention_masks() method to Gemma3ForConditionalGeneration - Implement 3-layer defense-in-depth guard mechanism The implementation uses check_gguf_file() to guarantee that custom attention mask logic only triggers for GGUF files, preventing the issue that caused the original revert where HF models incorrectly triggered the custom logic. Tested with GGUF models (1B, 4B, 270M) for both text-only and multimodal inference. HF model compatibility verified via pytest multimodal test suite. Signed-off-by: Luciano Martins <[email protected]>
Fixes seven critical issues in GGUF multimodal inference: 1. Attention scaling parameter bug (gemma3.py): - Fix F.scaled_dot_product_attention to use named parameters - Changed positional args to attn_mask=attn_mask, scale=self.scaling - Prevents incorrect dropout application (was 6.25% instead of 0%) 2. Custom attention mask persistence (gpu_model_runner.py): - Store custom_model_kwargs after mask generation - Merge custom_model_kwargs in _dummy_run - Prevents loss of attention masks during CUDA graph re-initialization 3. Pan-and-scan attention pattern (gemma3_mm.py): - Detect pan-and-scan mode via multimodal_config.do_pan_and_scan - Prevents crop isolation artifacts in sequential processing 4. GGUF unquantized weight loading (weight_utils.py): - Add proper dtype conversion for BF16/F16/F32 stored as uint8 - Handle byte-to-dtype conversion (BF16: 2 bytes, F16: 2 bytes, F32: 4 bytes) - Add fallback handling for unexpected dtype/type combinations - Fixes weight loading for unquantized GGUF multimodal projector weights 5. Fix GGUF pan-and-scan attention and token handling: - Unified bidirectional attention for all image tokens (fixes random output) - Fixed pan-and-scan token splits via position discontinuity detection - Corrected num_image_patches to reflect actual crop count from pixel_values 6. Add GGUF multimodal integration tests (test_multimodal_gguf.py): - Compare GGUF (vLLM) output against native HuggingFace (HfRunner) - Test regular multimodal inference with QAT Q4_0 GGUF - Test pan-and-scan multimodal with unquantized BF16 GGUF - Use @create_new_process_for_each_test for subprocess isolation 7. Robust sequence detection for batched multimodal (gemma3_mm.py): - Replace heuristic with - Pass query_start_loc from gpu_model_runner.py to generate_attention_masks() - Fixes multi-scale batching failures where num_computed_tokens > 0 - Ensures correct attention mask generation for all batch configurations Signed-off-by: Luciano Martins <[email protected]>
Summary
This PR restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in #28995. The implementation uses robust GGUF-only file format guards to ensure the feature exclusively applies to GGUF models and does not affect HuggingFace models.
Resolves: #28995 (HF model regression)
Restores functionality from: #27772
Background
PR #27772 initially added Gemma3 GGUF multimodal support, enabling users to run quantized Gemma3 multimodal models with both text-only and image+text prompts. However, it was partially reverted in #28995 because the custom attention mask logic incorrectly triggered for HuggingFace models, causing test failures.
Root cause of #28995: The original implementation lacked file format guards, causing the custom attention mask generation to activate for both GGUF and HF models.
Solution
This PR addresses the regression by implementing a 3-layer defense-in-depth guard mechanism:
Layer 1: Model Format Check (Primary Guard)
Layer 2: Multimodal Feature Check
Layer 3: Method Existence Check
Result: HF models never have
uses_custom_attention_masks = True, preventing the issue that caused #28995.Changes
Files Modified (4)
vllm/transformers_utils/config.pyuses_custom_attention_masks()utility functioncheck_gguf_file()vllm/config/model.pyuses_custom_attention_masksproperty toModelConfigvllm/v1/worker/gpu_model_runner.pyuses_custom_attention_masksattribute inGPUModelRunnervllm/model_executor/models/gemma3_mm.pygenerate_attention_masks()methodTest Plan
GGUF Model Validation
Tested with multiple quantized Gemma3 GGUF models to ensure functionality across different model sizes:
Text-Only Inference:
Multimodal Inference:
mmproj.ggufvision towerHuggingFace Model Regression Testing
Executed the full vLLM multimodal test suite to verify zero impact on HF models:
pytest -s -v tests/models/multimodal/generation/test_common.py -k "gemma3-test"This ensures the GGUF guards prevent any unintended activation of custom attention mask logic for HuggingFace models.
Test Results
GGUF Model Results (All Pass)
Multimodal Output Example:
HuggingFace Model Regression Test (All Pass)
pytest -s -v tests/models/multimodal/generation/test_common.py -k gemma3-test # Result: 8 passed, 335 deselected, 23 warnings in 915.69s (15m 15s)Test Coverage:
Verification of Fix for #28995
The failing test from #28995 (
pytest gemma3-test) now passes completely:Why it works now:
uses_custom_attention_masksreturnsFalsefor HF models (no.gguffile detected)Isolation & Safety Guarantees
How HF Models Are Protected:
File Format Check:
Short-Circuit Logic:
Runtime Guard:
What Changed from #27772:
Code Quality
ruff check,ruff format,mypy)Backward Compatibility
Documentation
No user-facing documentation changes required. The feature is transparent to users - GGUF Gemma3 multimodal models work automatically without configuration.
Release Notes
This fix should be included in release notes as:
Checklist
[Model] <description>git commit -s)Related PRs: