Skip to content

Conversation

@vaibhav-research
Copy link
Contributor

@vaibhav-research vaibhav-research commented Jan 4, 2026

What does this PR do?

This PR makes SigLIP2 text preprocessing explicit and consistent with how the model was trained.

It introduces a model specific tokenizer (Siglip2Tokenizer) that wraps the existing GemmaTokenizer while enforcing SigLIP2’s training-time defaults (lowercasing and fixed padding/truncation to 64 tokens). AutoTokenizer and AutoProcessor are updated to use this tokenizer for SigLIP2 checkpoints, so users get the correct behavior automatically without relying on implicit processor logic.

The underlying tokenization and vocabulary remain unchanged. This is a lightweight wrapper that improves correctness, reproducibility, and clarity, especially for text embedding and retrieval use cases.

This addresses the behavior reported in #43054.

Fixes # 43054

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@vaibhav-research
Copy link
Contributor Author

here are the test results including the one that I added in this PR.

pytest -q tests/models/siglip2 -k processing

tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_numpy PASSED         [  4%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_numpy_4_channels SKIPPED [  8%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_pil PASSED           [ 13%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_call_pytorch PASSED       [ 17%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_can_compile_fast_image_processor SKIPPED [ 21%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_cast_dtype_device PASSED  [ 26%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_fast_is_faster_than_slow SKIPPED [ 30%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_from_and_save_pretrained PASSED [ 34%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_from_dict_with_kwargs PASSED [ 39%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_preprocess_arguments PASSED [ 43%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_properties PASSED [ 47%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_save_load_with_autoimageprocessor PASSED [ 52%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_to_json_file PASSED [ 56%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_image_processor_to_json_string PASSED [ 60%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_init_without_params PASSED [ 65%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_is_fast PASSED            [ 69%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_new_models_require_fast_image_processor PASSED [ 73%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_override_instance_attributes_does_not_affect_other_instances SKIPPED [ 78%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_save_load_fast_slow PASSED [ 82%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_save_load_fast_slow_auto PASSED [ 86%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_slow_fast_equivalence PASSED [ 91%]
tests/models/siglip2/test_image_processing_siglip2.py::Siglip2ImageProcessingTest::test_slow_fast_equivalence_batched PASSED [ 95%]
tests/models/siglip2/test_processing_siglip2.py::Siglip2ProcessorTest::test_siglip2_text_padding_length_64 PASSED [100%]

=================================================== warnings summary ====================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================== 19 passed, 4 skipped, 561 deselected, 2 warnings in 9.15s ===============================

@vaibhav-research
Copy link
Contributor Author

I verified locally that SigLIP2 text embeddings behave as expected once the training-time preprocessing is applied (lowercasing + fixed-length padding with max_length=64).

This PR intentionally avoids changing processor defaults and instead documents the correct usage and adds a small test to prevent regressions.

Copy link

@fancyerii fancyerii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be done on a tokenizer level. Siglip model for example loads a SiglipTokenizer class which has a lower case called on inputs before tokenizing

@itazap can we use SiglipTokenizer here as well or maybe add a special Siglip2Tokenizer class?

@vaibhav-research
Copy link
Contributor Author

I think it should be done on a tokenizer level. Siglip model for example loads a SiglipTokenizer class which has a lower case called on inputs before tokenizing

@itazap can we use SiglipTokenizer here as well or maybe add a special Siglip2Tokenizer class?

@zucchini-nlp Thanks for reviewing this PR
SigLIP2 currently uses GemmaTokenizerFast for the text branch (see Siglip2Processor and convert_siglip2_to_hf.py). So IMHO we can’t directly reuse SiglipTokenizer (different tokenization backend).
To avoid relying on user-side lowercasing, I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage). If we strongly prefer tokenizer-level like SigLIP v1, we’d likely need a Siglip2TokenizerFast wrapper around GemmaTokenizerFast + AutoTokenizer mapping changes. Let me know which direction you’d prefer.
cc @ArthurZucker @itazap

@ArthurZucker
Copy link
Collaborator

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

@vaibhav-research
Copy link
Contributor Author

vaibhav-research commented Jan 5, 2026

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

thanks a lot for reviewing the PR @ArthurZucker
Yeah, agreed. Since SigLIP2 currently uses GemmaTokenizerFast, adding lowercasing at the tokenizer level would either affect Gemma globally or require a new tokenizer class + mapping. Implementing it in Siglip2Processor.call keeps the behavior SigLIP2-specific and covers the main multimodal usage. I’ll push a patch + regression test.

@vaibhav-research
Copy link
Contributor Author

vaibhav-research commented Jan 5, 2026

I can add automatic lowercasing in Siglip2Processor.call before passing text to the tokenizer (covers the main multimodal usage)

this sounds like the most reasonable solution no?

thanks a lot for reviewing the PR @ArthurZucker Yeah, agreed. Since SigLIP2 currently uses GemmaTokenizerFast, adding lowercasing at the tokenizer level would either affect Gemma globally or require a new tokenizer class + mapping. Implementing it in Siglip2Processor.call keeps the behavior SigLIP2-specific and covers the main multimodal usage. I’ll push a patch + regression test.

Thanks for the guidance @ArthurZucker , this is now addressed

I implemented automatic lowercasing and the SigLIP2 default text padding/truncation directly in SiglipProcessor.call, gated to SigLIP2-branded checkpoints. This avoids changing GemmaTokenizerFast globally or introducing a new tokenizer class, while ensuring SigLIP2 users get the correct preprocessing by default via AutoProcessor.

I have also updated the SigLIP2 docs to clarify the preprocessing expectations and note that these defaults are now handled automatically by the processor and added regression tests covering the following:
• identical outputs for upper/lower-case inputs
• default padding to length 64

please let me know if you’d prefer this logic to live elsewhere, but this seemed like the least intrusive fix.

@vaibhav-research
Copy link
Contributor Author

looking for your inputs here as well @ArthurZucker
TIA

@vaibhav-research vaibhav-research changed the title 43054: siglip2 text embedding test and document update 43054: siglip2 text embedding updated to default to lowercase Jan 7, 2026
Comment on lines +22 to +31
def _is_siglip2_checkpoint(processor) -> bool:
"""
SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor.
We detect SigLIP2 primarily via the image processor type/module, with a tokenizer
name/path fallback.
"""
# SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast
image_processor = getattr(processor, "image_processor", None)
if image_processor is not None:
mod = getattr(image_processor, "__module__", "") or ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a Siglip2Tokenizer class in that case because it's not guaranteed that users load a whole processor when they only want to encode text samples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @zucchini-nlp, processor-only fixes won’t cover text-only usage via AutoTokenizer.

SigLIP2 checkpoints ship:
• a SiglipConfig (model_type="siglip") → resolves to SiglipProcessor
• tokenizer_class="GemmaTokenizer" in tokenizer_config.json → resolves to GemmaTokenizer

siglip2 tokenizer_config.json

"model_max_length": 1000000000000000019884624838656,
  "pad_token": "<pad>",
  "padding_side": "right",
  "processor_class": "SiglipProcessor",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "GemmaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false

That means SigLIP2-specific text behavior (lowercasing + fixed-length padding to 64) is currently only applied when users go through the processor path. Text-only flows correctly end up using Gemma’s tokenizer, which is expected and correct for Gemma models, but not sufficient for SigLIP2.

We shouldn’t bake SigLIP2 behavior into GemmaTokenizer, since that would silently affect all Gemma-based checkpoints. The clean fix is a dedicated Siglip2Tokenizer wrapper subclassing GemmaTokenizer that applies SigLIP2-specific defaults, and can be explicitly selected (or wired via metadata later).

I’ll add a Siglip2Tokenizer + unit test validating lowercasing and default padding/truncation. I believe that AutoTokenizer won’t pick it up automatically until model metadata is updated, but this PR adds the correct tokenizer abstraction on the library side.

Copy link
Contributor Author

@vaibhav-research vaibhav-research Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp I’ve updated the PR accordingly.

I’ve added a Siglip2Tokenizer wrapper (subclassing GemmaTokenizer) to explicitly encode the SigLIP2 text-training assumptions at the tokenizer level as well (lowercasing + default padding/truncation to length 64), and added unit tests covering both processor and tokenizer behavior.

Also for completeness: SigLIP2 model repos currently declare "tokenizer_class": "GemmaTokenizer" in tokenizer_config.json(https://huggingface.co/google/siglip2-base-patch16-224/blob/main/tokenizer_config.json#L2017) , so AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer unless that metadata is updated. This PR adds the correct tokenizer implementation on the library side, but switching the config to Siglip2Tokenizer would be required for the wrapper to be picked up automatically.

looking for your feedback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer

This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying. I was assuming tokenizer_class in the checkpoint would always “win” and keep routing SigLIP2 to GemmaTokenizer, but I see now AutoTokenizer checks the serialized class against the tokenizer_auto.py mapping for the model type

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, siglip, siglip2

@vaibhav-research vaibhav-research changed the title 43054: siglip2 text embedding updated to default to lowercase 43054: siglip2 text embedding default to lowercase Jan 9, 2026
@vaibhav-research vaibhav-research changed the title 43054: siglip2 text embedding default to lowercase 43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults Jan 9, 2026
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I really don't think we should do any of the fixes here, but rather on the hub.
We can add a siglip2 tokenizer but it should be explicit that it uses a lowercase normalizer, and should be for all Siglip2 / siglip_2 tokenizer. If its just for 1-2 then it should be updated on the hub directly !

Comment on lines +22 to +31
def _is_siglip2_checkpoint(processor) -> bool:
"""
SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor.
We detect SigLIP2 primarily via the image processor type/module, with a tokenizer
name/path fallback.
"""
# SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast
image_processor = getattr(processor, "image_processor", None)
if image_processor is not None:
mod = getattr(image_processor, "__module__", "") or ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer

This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want

def __init__(self, image_processor, tokenizer):
super().__init__(image_processor, tokenizer)

def __call__(self, images=None, text=None, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be done here

return x


class Siglip2Tokenizer(GemmaTokenizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never use inheritance in transformers like this. Lowercasing is a normalizer from normalizers.Lowercase that just needs to be added to the sequence of normalizers for siglip models.

Comment on lines +72 to +78
text = _lowercase_text(text)

# SigLIP2 text encoder trained with padding to 64 tokens
# Only set defaults if the user didn't specify them.
kwargs.setdefault("padding", "max_length")
kwargs.setdefault("truncation", True)
kwargs.setdefault("max_length", 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed now? I think these are saved in tokenizer/processor config, except for lowercase

@@ -0,0 +1,58 @@
# Copyright 2024 The HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2025

return x


class Siglip2Tokenizer(GemmaTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are to inherit from existing tokenizer class, we better place it in modular. Modular will take care of copying the code and making sure that Siglip2 doesn't depend on Gemma at run-time

Comment on lines +50 to +53
# SigLIP2 training: fixed padding/truncation to 64
kwargs.setdefault("padding", "max_length")
kwargs.setdefault("truncation", True)
kwargs.setdefault("max_length", 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this. It also has to be in saved config and we can also add default values in ModelProcessorKwargs

Comment on lines +47 to +48
text = _lowercase_text(text)
text_pair = _lowercase_text(text_pair)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see that configs on the hub have a field do_lower_case, imo we need to lower case if that field is set to True


@require_torch
@require_tokenizers
@slow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it slow?

@require_torch
@require_tokenizers
@slow
class Siglip2ProcessorTest(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's inherit from ProcessorTesterMixin to get common tests

Comment on lines +39 to +43
@require_torch
@require_tokenizers
@slow
class Siglip2TokenizerTest(unittest.TestCase):
def test_lowercasing_and_padding_defaults(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, not slow and inherit from TokenizerTesterMixin to get common tests. Also tokenizer tests are usually in their own separate file

@vaibhav-research
Copy link
Contributor Author

Thanks for detailed review of this @zucchini-nlp @ArthurZucker
I got a lot of pointers to fix :)
will address each of these and confirm once done.
thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants