43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults #43101

vaibhav-research · 2026-01-04T22:59:05Z

What does this PR do?

This PR makes SigLIP2 text preprocessing explicit and consistent with how the model was trained.

It introduces a model specific tokenizer (Siglip2Tokenizer) that wraps the existing GemmaTokenizer while enforcing SigLIP2’s training-time defaults (lowercasing and fixed padding/truncation to 64 tokens). AutoTokenizer and AutoProcessor are updated to use this tokenizer for SigLIP2 checkpoints, so users get the correct behavior automatically without relying on implicit processor logic.

The underlying tokenization and vocabulary remain unchanged. This is a lightweight wrapper that improves correctness, reproducibility, and clarity, especially for text embedding and retrieval use cases.

This addresses the behavior reported in #43054.

Fixes # 43054

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

vaibhav-research · 2026-01-04T23:01:38Z

here are the test results including the one that I added in this PR.

pytest -q tests/models/siglip2 -k processing

tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_numpy PASSED         [  4%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_numpy_4_channels SKIPPED [  8%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_pil PASSED           [ 13%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_pytorch PASSED       [ 17%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_can_compile_fast_image_processor SKIPPED [ 21%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_cast_dtype_device PASSED  [ 26%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_fast_is_faster_than_slow SKIPPED [ 30%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_from_and_save_pretrained PASSED [ 34%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_from_dict_with_kwargs PASSED [ 39%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_preprocess_arguments PASSED [ 43%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_properties PASSED [ 47%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_save_load_with_autoimageprocessor PASSED [ 52%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_to_json_file PASSED [ 56%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_to_json_string PASSED [ 60%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_init_without_params PASSED [ 65%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_is_fast PASSED            [ 69%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_new_models_require_fast_image_processor PASSED [ 73%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_override_instance_attributes_does_not_affect_other_instances SKIPPED [ 78%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_save_load_fast_slow PASSED [ 82%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_save_load_fast_slow_auto PASSED [ 86%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_slow_fast_equivalence PASSED [ 91%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_slow_fast_equivalence_batched PASSED [ 95%]
tests/models/siglip2/test_processing_siglip2.py::Siglip2ProcessorTest::test_siglip2_text_padding_length_64 PASSED [100%]

=================================================== warnings summary ====================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================== 19 passed, 4 skipped, 561 deselected, 2 warnings in 9.15s ===============================

vaibhav-research · 2026-01-04T23:24:10Z

I verified locally that SigLIP2 text embeddings behave as expected once the training-time preprocessing is applied (lowercasing + fixed-length padding with max_length=64).

This PR intentionally avoids changing processor defaults and instead documents the correct usage and adds a small test to prevent regressions.

fancyerii

It's ok.

zucchini-nlp

I think it should be done on a tokenizer level. Siglip model for example loads a SiglipTokenizer class which has a lower case called on inputs before tokenizing

@itazap can we use SiglipTokenizer here as well or maybe add a special Siglip2Tokenizer class?

vaibhav-research · 2026-01-05T14:24:06Z

I think it should be done on a tokenizer level. Siglip model for example loads a SiglipTokenizer class which has a lower case called on inputs before tokenizing

@itazap can we use SiglipTokenizer here as well or maybe add a special Siglip2Tokenizer class?

@zucchini-nlp Thanks for reviewing this PR
SigLIP2 currently uses GemmaTokenizerFast for the text branch (see Siglip2Processor and convert_siglip2_to_hf.py). So IMHO we can’t directly reuse SiglipTokenizer (different tokenization backend).
To avoid relying on user-side lowercasing, I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage). If we strongly prefer tokenizer-level like SigLIP v1, we’d likely need a Siglip2TokenizerFast wrapper around GemmaTokenizerFast + AutoTokenizer mapping changes. Let me know which direction you’d prefer.
cc @ArthurZucker @itazap

ArthurZucker · 2026-01-05T15:44:15Z

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

vaibhav-research · 2026-01-05T18:50:35Z

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

thanks a lot for reviewing the PR @ArthurZucker
Yeah, agreed. Since SigLIP2 currently uses GemmaTokenizerFast, adding lowercasing at the tokenizer level would either affect Gemma globally or require a new tokenizer class + mapping. Implementing it in Siglip2Processor.call keeps the behavior SigLIP2-specific and covers the main multimodal usage. I’ll push a patch + regression test.

vaibhav-research · 2026-01-05T20:27:05Z

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

thanks a lot for reviewing the PR @ArthurZucker Yeah, agreed. Since SigLIP2 currently uses GemmaTokenizerFast, adding lowercasing at the tokenizer level would either affect Gemma globally or require a new tokenizer class + mapping. Implementing it in Siglip2Processor.call keeps the behavior SigLIP2-specific and covers the main multimodal usage. I’ll push a patch + regression test.

Thanks for the guidance @ArthurZucker , this is now addressed

I implemented automatic lowercasing and the SigLIP2 default text padding/truncation directly in SiglipProcessor.call, gated to SigLIP2-branded checkpoints. This avoids changing GemmaTokenizerFast globally or introducing a new tokenizer class, while ensuring SigLIP2 users get the correct preprocessing by default via AutoProcessor.

I have also updated the SigLIP2 docs to clarify the preprocessing expectations and note that these defaults are now handled automatically by the processor and added regression tests covering the following:
• identical outputs for upper/lower-case inputs
• default padding to length 64

please let me know if you’d prefer this logic to live elsewhere, but this seemed like the least intrusive fix.

vaibhav-research · 2026-01-07T18:02:00Z

looking for your inputs here as well @ArthurZucker
TIA

zucchini-nlp · 2026-01-08T11:12:05Z

src/transformers/models/siglip/processing_siglip.py

+def _is_siglip2_checkpoint(processor) -> bool:
+    """
+    SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor.
+    We detect SigLIP2 primarily via the image processor type/module, with a tokenizer
+    name/path fallback.
+    """
+    # SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast
+    image_processor = getattr(processor, "image_processor", None)
+    if image_processor is not None:
+        mod = getattr(image_processor, "__module__", "") or ""


I think we need a Siglip2Tokenizer class in that case because it's not guaranteed that users load a whole processor when they only want to encode text samples

Good point @zucchini-nlp, processor-only fixes won’t cover text-only usage via AutoTokenizer.

SigLIP2 checkpoints ship:
• a SiglipConfig (model_type="siglip") → resolves to SiglipProcessor
• tokenizer_class="GemmaTokenizer" in tokenizer_config.json → resolves to GemmaTokenizer

siglip2 tokenizer_config.json

"model_max_length": 1000000000000000019884624838656, "pad_token": "<pad>", "padding_side": "right", "processor_class": "SiglipProcessor", "sp_model_kwargs": {}, "spaces_between_special_tokens": false, "tokenizer_class": "GemmaTokenizer", "unk_token": "<unk>", "use_default_system_prompt": false

That means SigLIP2-specific text behavior (lowercasing + fixed-length padding to 64) is currently only applied when users go through the processor path. Text-only flows correctly end up using Gemma’s tokenizer, which is expected and correct for Gemma models, but not sufficient for SigLIP2.

We shouldn’t bake SigLIP2 behavior into GemmaTokenizer, since that would silently affect all Gemma-based checkpoints. The clean fix is a dedicated Siglip2Tokenizer wrapper subclassing GemmaTokenizer that applies SigLIP2-specific defaults, and can be explicitly selected (or wired via metadata later).

I’ll add a Siglip2Tokenizer + unit test validating lowercasing and default padding/truncation. I believe that AutoTokenizer won’t pick it up automatically until model metadata is updated, but this PR adds the correct tokenizer abstraction on the library side.

@zucchini-nlp I’ve updated the PR accordingly.

I’ve added a Siglip2Tokenizer wrapper (subclassing GemmaTokenizer) to explicitly encode the SigLIP2 text-training assumptions at the tokenizer level as well (lowercasing + default padding/truncation to length 64), and added unit tests covering both processor and tokenizer behavior.

Also for completeness: SigLIP2 model repos currently declare "tokenizer_class": "GemmaTokenizer" in tokenizer_config.json(https://huggingface.co/google/siglip2-base-patch16-224/blob/main/tokenizer_config.json#L2017) , so AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer unless that metadata is updated. This PR adds the correct tokenizer implementation on the library side, but switching the config to Siglip2Tokenizer would be required for the wrapper to be picked up automatically.

looking for your feedback.

AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer

This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want

Got it, thanks for clarifying. I was assuming tokenizer_class in the checkpoint would always “win” and keep routing SigLIP2 to GemmaTokenizer, but I see now AutoTokenizer checks the serialized class against the tokenizer_auto.py mapping for the model type

github-actions · 2026-01-08T20:30:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, siglip, siglip2

ArthurZucker

Hey! I really don't think we should do any of the fixes here, but rather on the hub.
We can add a siglip2 tokenizer but it should be explicit that it uses a lowercase normalizer, and should be for all Siglip2 / siglip_2 tokenizer. If its just for 1-2 then it should be updated on the hub directly !

ArthurZucker · 2026-01-09T09:21:34Z

src/transformers/models/siglip/processing_siglip.py

+def _is_siglip2_checkpoint(processor) -> bool:
+    """
+    SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor.
+    We detect SigLIP2 primarily via the image processor type/module, with a tokenizer
+    name/path fallback.
+    """
+    # SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast
+    image_processor = getattr(processor, "image_processor", None)
+    if image_processor is not None:
+        mod = getattr(image_processor, "__module__", "") or ""


AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer

This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want

ArthurZucker · 2026-01-09T09:21:49Z

src/transformers/models/siglip/processing_siglip.py

    def __init__(self, image_processor, tokenizer):
        super().__init__(image_processor, tokenizer)

+    def __call__(self, images=None, text=None, **kwargs):


this should not be done here

ArthurZucker · 2026-01-09T09:22:47Z

src/transformers/models/siglip2/tokenization_siglip2.py

+    return x
+
+
+class Siglip2Tokenizer(GemmaTokenizer):


we never use inheritance in transformers like this. Lowercasing is a normalizer from normalizers.Lowercase that just needs to be added to the sequence of normalizers for siglip models.

zucchini-nlp · 2026-01-09T08:42:05Z

src/transformers/models/siglip/processing_siglip.py

+            text = _lowercase_text(text)
+
+            # SigLIP2 text encoder trained with padding to 64 tokens
+            # Only set defaults if the user didn't specify them.
+            kwargs.setdefault("padding", "max_length")
+            kwargs.setdefault("truncation", True)
+            kwargs.setdefault("max_length", 64)


is this needed now? I think these are saved in tokenizer/processor config, except for lowercase

zucchini-nlp · 2026-01-09T08:42:26Z

src/transformers/models/siglip2/tokenization_siglip2.py

@@ -0,0 +1,58 @@
+# Copyright 2024 The HuggingFace Inc. team.


zucchini-nlp · 2026-01-09T08:43:38Z

src/transformers/models/siglip2/tokenization_siglip2.py

+    return x
+
+
+class Siglip2Tokenizer(GemmaTokenizer):


if we are to inherit from existing tokenizer class, we better place it in modular. Modular will take care of copying the code and making sure that Siglip2 doesn't depend on Gemma at run-time

zucchini-nlp · 2026-01-09T09:15:04Z

src/transformers/models/siglip2/tokenization_siglip2.py

+        # SigLIP2 training: fixed padding/truncation to 64
+        kwargs.setdefault("padding", "max_length")
+        kwargs.setdefault("truncation", True)
+        kwargs.setdefault("max_length", 64)


not sure about this. It also has to be in saved config and we can also add default values in ModelProcessorKwargs

zucchini-nlp · 2026-01-09T09:17:15Z

src/transformers/models/siglip2/tokenization_siglip2.py

+        text = _lowercase_text(text)
+        text_pair = _lowercase_text(text_pair)


nit: I see that configs on the hub have a field do_lower_case, imo we need to lower case if that field is set to True

zucchini-nlp · 2026-01-09T09:55:07Z

tests/models/siglip2/test_processing_siglip2.py

+
+@require_torch
+@require_tokenizers
+@slow


why is it slow?

zucchini-nlp · 2026-01-09T09:55:37Z

tests/models/siglip2/test_processing_siglip2.py

+@require_torch
+@require_tokenizers
+@slow
+class Siglip2ProcessorTest(unittest.TestCase):


let's inherit from ProcessorTesterMixin to get common tests

zucchini-nlp · 2026-01-09T09:56:07Z

tests/models/siglip2/test_processing_siglip2.py

+@require_torch
+@require_tokenizers
+@slow
+class Siglip2TokenizerTest(unittest.TestCase):
+    def test_lowercasing_and_padding_defaults(self):


same, not slow and inherit from TokenizerTesterMixin to get common tests. Also tokenizer tests are usually in their own separate file

vaibhav-research · 2026-01-09T14:23:58Z

Thanks for detailed review of this @zucchini-nlp @ArthurZucker
I got a lot of pointers to fix :)
will address each of these and confirm once done.
thanks again

vaibhav-research and others added 2 commits January 4, 2026 17:56

43054: siglip2 text embedding test and document update

acb835f

Merge branch 'main' into siglip2_text_embedding

c1a99aa

updated test

0aa0672

updated test

26c940e

fancyerii approved these changes Jan 5, 2026

View reviewed changes

zucchini-nlp reviewed Jan 5, 2026

View reviewed changes

Merge branch 'main' into siglip2_text_embedding

82c6ef8

vaibhav-research and others added 3 commits January 5, 2026 15:20

Updated siglip2 to use lowercase by default

a901e0d

Updated siglip2 to use lowercase by default

b4af8b6

Merge branch 'main' into siglip2_text_embedding

9f1696f

vaibhav-research and others added 2 commits January 7, 2026 12:52

reformatted the siglip file

b39a27f

Merge branch 'main' into siglip2_text_embedding

dba9417

vaibhav-research changed the title ~~43054: siglip2 text embedding test and document update~~ 43054: siglip2 text embedding updated to default to lowercase Jan 7, 2026

zucchini-nlp reviewed Jan 8, 2026

View reviewed changes

vaibhav-research and others added 5 commits January 8, 2026 12:51

Merge branch 'main' into siglip2_text_embedding

0eb4c23

Create siglip2tokenizer wrapper class

4f5e82b

Create siglip2tokenizer wrapper class

fbdd38c

Create siglip2tokenizer wrapper class

09373e9

Created siglip2tokenizer wrapper class

466861d

vaibhav-research changed the title ~~43054: siglip2 text embedding updated to default to lowercase~~ 43054: siglip2 text embedding default to lowercase Jan 9, 2026

vaibhav-research changed the title ~~43054: siglip2 text embedding default to lowercase~~ 43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults Jan 9, 2026

ArthurZucker reviewed Jan 9, 2026

View reviewed changes

zucchini-nlp reviewed Jan 9, 2026

View reviewed changes

		text = _lowercase_text(text)
		text_pair = _lowercase_text(text_pair)

43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults #43101

Are you sure you want to change the base?

43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults #43101

Uh oh!

Conversation

vaibhav-research commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vaibhav-research commented Jan 4, 2026

Uh oh!

vaibhav-research commented Jan 4, 2026

Uh oh!

fancyerii left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhav-research commented Jan 5, 2026

Uh oh!

ArthurZucker commented Jan 5, 2026

Uh oh!

vaibhav-research commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaibhav-research commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaibhav-research commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhav-research Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhav-research commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

vaibhav-research commented Jan 4, 2026 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

vaibhav-research commented Jan 5, 2026 •

edited

Loading

vaibhav-research commented Jan 5, 2026 •

edited

Loading

vaibhav-research Jan 8, 2026 •

edited

Loading