Skip to content

Commit df47c7a

Browse files
lucianommartinsChenyaaang
authored andcommitted
[Model] Revert PR vllm-project#26715: Restore custom PaliGemma and Gemma3-MM impl… (vllm-project#27309)
Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]>
1 parent 5f8c69a commit df47c7a

File tree

12 files changed

+1219
-54
lines changed

12 files changed

+1219
-54
lines changed

docs/models/hardware_supported_models/tpu.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
| meta-llama/Llama-4-* | Llama4ForConditionalGeneration ||
1717
| microsoft/Phi-3-mini-128k-instruct | Phi3ForCausalLM | 🟨 |
1818
| microsoft/phi-4 | Phi3ForCausalLM ||
19-
| google/gemma-3-27b-it | TransformersForMultimodalLM | 🟨 |
20-
| google/gemma-3-4b-it | TransformersForMultimodalLM ||
19+
| google/gemma-3-27b-it | Gemma3ForConditionalGeneration | 🟨 |
20+
| google/gemma-3-4b-it | Gemma3ForConditionalGeneration ||
2121
| deepseek-ai/DeepSeek-R1 | DeepseekV3ForCausalLM ||
2222
| deepseek-ai/DeepSeek-V3 | DeepseekV3ForCausalLM ||
2323
| RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 | LlamaForCausalLM ||

docs/models/supported_models.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -641,6 +641,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
641641
| `DeepseekVLV2ForCausalLM`<sup>^</sup> | DeepSeek-VL2 | T + I<sup>+</sup> | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, etc. | | ✅︎ |
642642
| `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I<sup>+</sup>/ V<sup>+</sup> | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ |
643643
| `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ |
644+
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ |
644645
| `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | |
645646
| `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ |
646647
| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ |
@@ -670,6 +671,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
670671
| `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ |
671672
| `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ |
672673
| `Ovis2_5` | Ovis2.5 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.5-9B`, etc. | | |
674+
| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I<sup>E</sup> | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | | ✅︎ |
673675
| `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + I<sup>E+</sup> | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | | ✅︎ |
674676
| `Phi4MMForCausalLM` | Phi-4-multimodal | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ |
675677
| `Phi4MultimodalForCausalLM` | Phi-4-multimodal (HF Transformers) | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct` (with revision `refs/pr/70`), etc. | ✅︎ | ✅︎ |
@@ -694,8 +696,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
694696
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
695697
|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------|
696698
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ |
697-
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ |
698-
| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I<sup>E</sup> | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | ✅︎ | ✅︎ |
699699

700700
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
701701
&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
@@ -704,7 +704,21 @@ Some models are supported only via the [Transformers backend](#transformers). Th
704704
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
705705

706706
!!! warning
707-
For `Gemma3ForConditionalGeneration`, `{"do_pan_and_scan": true}` is not supported in Transformers backend yet.
707+
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
708+
However, there are differences in how they handle text + image inputs:
709+
710+
V0 correctly implements the model's attention pattern:
711+
- Uses bidirectional attention between the image tokens corresponding to the same image
712+
- Uses causal attention for other tokens
713+
- Implemented via (naive) PyTorch SDPA with masking tensors
714+
- Note: May use significant memory for long prompts with image
715+
716+
V1 currently uses a simplified attention pattern:
717+
- Uses causal attention for all tokens, including image tokens
718+
- Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}`
719+
- Will be updated in the future to support the correct behavior
720+
721+
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
708722

709723
!!! note
710724
`Gemma3nForConditionalGeneration` is only supported on V1 due to shared KV caching and it depends on `timm>=1.0.17` to make use of its
@@ -756,6 +770,9 @@ Some models are supported only via the [Transformers backend](#transformers). Th
756770
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
757771
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
758772

773+
!!! warning
774+
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
775+
759776
!!! note
760777
For Qwen2.5-Omni and Qwen3-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) is currently work in progress and not yet supported.
761778

examples/offline_inference/vision_language.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -275,8 +275,7 @@ def run_gemma3(questions: list[str], modality: str) -> ModelRequestData:
275275
model=model_name,
276276
max_model_len=2048,
277277
max_num_seqs=2,
278-
# TODO: Support this in transformers backend
279-
# mm_processor_kwargs={"do_pan_and_scan": True},
278+
mm_processor_kwargs={"do_pan_and_scan": True},
280279
limit_mm_per_prompt={modality: 1},
281280
)
282281

tests/models/language/generation/test_gemma.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import numpy as np
44
import pytest
55

6-
MODELS = ["google/gemma-2b", "google/gemma-2-2b"]
6+
MODELS = ["google/gemma-2b", "google/gemma-2-2b", "google/gemma-3-4b-it"]
77

88

99
@pytest.mark.parametrize("model", MODELS)
@@ -14,8 +14,14 @@ def test_dummy_loader(vllm_runner, monkeypatch, model: str) -> None:
1414
model,
1515
load_format="dummy",
1616
) as llm:
17-
normalizers = llm.apply_model(
18-
lambda model: model.model.normalizer.cpu().item()
19-
)
20-
config = llm.llm.llm_engine.model_config.hf_config
17+
if model == "google/gemma-3-4b-it":
18+
normalizers = llm.llm.collective_rpc(
19+
lambda self: self.model_runner.model.language_model.model.normalizer.cpu().item() # noqa: E501
20+
)
21+
config = llm.llm.llm_engine.model_config.hf_config.text_config
22+
else:
23+
normalizers = llm.llm.collective_rpc(
24+
lambda self: self.model_runner.model.model.normalizer.cpu().item()
25+
)
26+
config = llm.llm.llm_engine.model_config.hf_config
2127
assert np.allclose(normalizers, config.hidden_size**0.5, rtol=2e-3)

tests/models/multimodal/generation/test_common.py

Lines changed: 40 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,25 @@
113113
dtype="bfloat16" if current_platform.is_cpu() else "auto",
114114
marks=[pytest.mark.core_model, pytest.mark.cpu_model],
115115
),
116+
"paligemma": VLMTestInfo(
117+
models=["google/paligemma-3b-mix-224"],
118+
test_type=VLMTestType.IMAGE,
119+
prompt_formatter=identity,
120+
img_idx_to_prompt=lambda idx: "",
121+
# Paligemma uses its own sample prompts because the default one fails
122+
single_image_prompts=IMAGE_ASSETS.prompts(
123+
{
124+
"stop_sign": "caption es",
125+
"cherry_blossom": "What is in the picture?",
126+
}
127+
),
128+
auto_cls=AutoModelForImageTextToText,
129+
vllm_output_post_proc=model_utils.paligemma_vllm_to_hf_output,
130+
dtype="bfloat16",
131+
marks=[
132+
pytest.mark.skip(reason="vLLM does not support PrefixLM attention mask")
133+
],
134+
),
116135
"qwen2_5_vl": VLMTestInfo(
117136
models=["Qwen/Qwen2.5-VL-3B-Instruct"],
118137
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE, VLMTestType.VIDEO),
@@ -177,24 +196,14 @@
177196
# Gemma3 has bidirectional mask on images
178197
"gemma3-transformers": VLMTestInfo(
179198
models=["google/gemma-3-4b-it"],
180-
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
181-
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
182-
single_image_prompts=IMAGE_ASSETS.prompts(
183-
{
184-
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
185-
"cherry_blossom": "<start_of_image>What is the season?",
186-
}
187-
),
188-
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
189-
max_model_len=8192,
199+
test_type=VLMTestType.IMAGE,
200+
prompt_formatter=lambda vid_prompt: f"<'<bos><start_of_turn>user\n{vid_prompt}<start_of_image><end_of_turn>\n<start_of_turn>model\n", # noqa: E501
201+
max_model_len=4096,
190202
auto_cls=AutoModelForImageTextToText,
191-
# TODO: Support `do_pan_and_scan` in transformers backend
192-
# patch_hf_runner=model_utils.gemma3_patch_hf_runner,
193203
vllm_output_post_proc=model_utils.gemma3_vllm_to_hf_output,
194204
image_size_factors=[(0.25, 0.5, 1.0)],
195205
vllm_runner_kwargs={
196206
"model_impl": "transformers",
197-
# "mm_processor_kwargs": {"do_pan_and_scan": True},
198207
},
199208
marks=[pytest.mark.core_model],
200209
),
@@ -213,27 +222,6 @@
213222
},
214223
marks=[pytest.mark.core_model],
215224
),
216-
# PaliGemma has PrefixLM attention
217-
"paligemma-transformers": VLMTestInfo(
218-
models=["google/paligemma-3b-mix-224"],
219-
test_type=VLMTestType.IMAGE,
220-
prompt_formatter=identity,
221-
img_idx_to_prompt=lambda idx: "",
222-
# PaliGemma uses its own sample prompts because the default one fails
223-
single_image_prompts=IMAGE_ASSETS.prompts(
224-
{
225-
"stop_sign": "caption es",
226-
"cherry_blossom": "What is in the picture?",
227-
}
228-
),
229-
auto_cls=AutoModelForImageTextToText,
230-
vllm_output_post_proc=model_utils.paligemma_vllm_to_hf_output,
231-
image_size_factors=[(0.25, 0.5, 1.0)],
232-
vllm_runner_kwargs={
233-
"model_impl": "transformers",
234-
},
235-
marks=[pytest.mark.core_model],
236-
),
237225
# Pixel values from processor are not 4D or 5D arrays
238226
"qwen2_5_vl-transformers": VLMTestInfo(
239227
models=["Qwen/Qwen2.5-VL-3B-Instruct"],
@@ -360,6 +348,24 @@
360348
image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)],
361349
marks=[large_gpu_mark(min_gb=32)],
362350
),
351+
"gemma3": VLMTestInfo(
352+
models=["google/gemma-3-4b-it"],
353+
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
354+
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
355+
single_image_prompts=IMAGE_ASSETS.prompts(
356+
{
357+
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
358+
"cherry_blossom": "<start_of_image>What is the season?",
359+
}
360+
),
361+
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
362+
max_model_len=4096,
363+
max_num_seqs=2,
364+
auto_cls=AutoModelForImageTextToText,
365+
vllm_runner_kwargs={"mm_processor_kwargs": {"do_pan_and_scan": True}},
366+
patch_hf_runner=model_utils.gemma3_patch_hf_runner,
367+
num_logprobs=10,
368+
),
363369
"glm4v": VLMTestInfo(
364370
models=["zai-org/glm-4v-9b"],
365371
test_type=VLMTestType.IMAGE,

tests/models/multimodal/generation/vlm_utils/model_utils.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,16 @@ def processor(*args, **kwargs):
328328

329329
hf_model.processor = processor
330330

331+
orig_generate = hf_model.model.generate
332+
333+
def _generate(self, *args, **kwargs):
334+
# FIXME: https://github.com/huggingface/transformers/issues/38333
335+
kwargs["disable_compile"] = True
336+
337+
return orig_generate(*args, **kwargs)
338+
339+
hf_model.model.generate = types.MethodType(_generate, hf_model.model)
340+
331341
return hf_model
332342

333343

tests/models/multimodal/processing/test_common.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,7 @@ def _to_dummy_options(modality: str, count: int) -> BaseDummyOptions:
222222
_ADD_SPECIAL_TOKENS_OVERRIDES = {
223223
"ovis": False,
224224
"ovis2_5": False,
225+
"paligemma": False,
225226
"ultravox": False,
226227
"whisper": False,
227228
}
@@ -333,6 +334,7 @@ def _test_processing_correctness_one(
333334
"deepseek-ai/deepseek-vl2-tiny",
334335
"baidu/ERNIE-4.5-VL-28B-A3B-PT",
335336
"adept/fuyu-8b",
337+
"google/gemma-3-4b-it",
336338
"google/gemma-3n-E2B-it",
337339
"zai-org/glm-4v-9b",
338340
"zai-org/GLM-4.1V-9B-Thinking",
@@ -369,6 +371,8 @@ def _test_processing_correctness_one(
369371
"AIDC-AI/Ovis1.6-Llama3.2-3B",
370372
"AIDC-AI/Ovis2-1B",
371373
"AIDC-AI/Ovis2.5-2B",
374+
"google/paligemma-3b-mix-224",
375+
"google/paligemma2-3b-ft-docci-448",
372376
"microsoft/Phi-3.5-vision-instruct",
373377
"microsoft/Phi-4-multimodal-instruct",
374378
"mistralai/Pixtral-12B-2409",

tests/models/multimodal/processing/test_tensor_schema.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@
4848
"Idefics3ForConditionalGeneration",
4949
"LlavaForConditionalGeneration",
5050
"MiniCPMV",
51+
"PaliGemmaForConditionalGeneration",
5152
]
5253
REPO_ID_TO_SKIP = {
5354
"nm-testing/pixtral-12b-FP8-dynamic": "duplicated test",

0 commit comments

Comments
 (0)