-
Notifications
You must be signed in to change notification settings - Fork 31.7k
43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults #43101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults #43101
Conversation
|
here are the test results including the one that I added in this PR. |
|
I verified locally that SigLIP2 text embeddings behave as expected once the training-time preprocessing is applied (lowercasing + fixed-length padding with max_length=64). This PR intentionally avoids changing processor defaults and instead documents the correct usage and adds a small test to prevent regressions. |
fancyerii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be done on a tokenizer level. Siglip model for example loads a SiglipTokenizer class which has a lower case called on inputs before tokenizing
@itazap can we use SiglipTokenizer here as well or maybe add a special Siglip2Tokenizer class?
@zucchini-nlp Thanks for reviewing this PR |
this sounds like the most reasonable solution no? |
thanks a lot for reviewing the PR @ArthurZucker |
Thanks for the guidance @ArthurZucker , this is now addressed I implemented automatic lowercasing and the SigLIP2 default text padding/truncation directly in SiglipProcessor.call, gated to SigLIP2-branded checkpoints. This avoids changing GemmaTokenizerFast globally or introducing a new tokenizer class, while ensuring SigLIP2 users get the correct preprocessing by default via AutoProcessor. I have also updated the SigLIP2 docs to clarify the preprocessing expectations and note that these defaults are now handled automatically by the processor and added regression tests covering the following: please let me know if you’d prefer this logic to live elsewhere, but this seemed like the least intrusive fix. |
|
looking for your inputs here as well @ArthurZucker |
| def _is_siglip2_checkpoint(processor) -> bool: | ||
| """ | ||
| SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor. | ||
| We detect SigLIP2 primarily via the image processor type/module, with a tokenizer | ||
| name/path fallback. | ||
| """ | ||
| # SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast | ||
| image_processor = getattr(processor, "image_processor", None) | ||
| if image_processor is not None: | ||
| mod = getattr(image_processor, "__module__", "") or "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a Siglip2Tokenizer class in that case because it's not guaranteed that users load a whole processor when they only want to encode text samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @zucchini-nlp, processor-only fixes won’t cover text-only usage via AutoTokenizer.
SigLIP2 checkpoints ship:
• a SiglipConfig (model_type="siglip") → resolves to SiglipProcessor
• tokenizer_class="GemmaTokenizer" in tokenizer_config.json → resolves to GemmaTokenizer
siglip2 tokenizer_config.json
"model_max_length": 1000000000000000019884624838656,
"pad_token": "<pad>",
"padding_side": "right",
"processor_class": "SiglipProcessor",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "GemmaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
That means SigLIP2-specific text behavior (lowercasing + fixed-length padding to 64) is currently only applied when users go through the processor path. Text-only flows correctly end up using Gemma’s tokenizer, which is expected and correct for Gemma models, but not sufficient for SigLIP2.
We shouldn’t bake SigLIP2 behavior into GemmaTokenizer, since that would silently affect all Gemma-based checkpoints. The clean fix is a dedicated Siglip2Tokenizer wrapper subclassing GemmaTokenizer that applies SigLIP2-specific defaults, and can be explicitly selected (or wired via metadata later).
I’ll add a Siglip2Tokenizer + unit test validating lowercasing and default padding/truncation. I believe that AutoTokenizer won’t pick it up automatically until model metadata is updated, but this PR adds the correct tokenizer abstraction on the library side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp I’ve updated the PR accordingly.
I’ve added a Siglip2Tokenizer wrapper (subclassing GemmaTokenizer) to explicitly encode the SigLIP2 text-training assumptions at the tokenizer level as well (lowercasing + default padding/truncation to length 64), and added unit tests covering both processor and tokenizer behavior.
Also for completeness: SigLIP2 model repos currently declare "tokenizer_class": "GemmaTokenizer" in tokenizer_config.json(https://huggingface.co/google/siglip2-base-patch16-224/blob/main/tokenizer_config.json#L2017) , so AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer unless that metadata is updated. This PR adds the correct tokenizer implementation on the library side, but switching the config to Siglip2Tokenizer would be required for the wrapper to be picked up automatically.
looking for your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer
This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for clarifying. I was assuming tokenizer_class in the checkpoint would always “win” and keep routing SigLIP2 to GemmaTokenizer, but I see now AutoTokenizer checks the serialized class against the tokenizer_auto.py mapping for the model type
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, siglip, siglip2 |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I really don't think we should do any of the fixes here, but rather on the hub.
We can add a siglip2 tokenizer but it should be explicit that it uses a lowercase normalizer, and should be for all Siglip2 / siglip_2 tokenizer. If its just for 1-2 then it should be updated on the hub directly !
| def _is_siglip2_checkpoint(processor) -> bool: | ||
| """ | ||
| SigLIP2 checkpoints currently ship a SiglipConfig (model_type='siglip') and thus load SiglipProcessor. | ||
| We detect SigLIP2 primarily via the image processor type/module, with a tokenizer | ||
| name/path fallback. | ||
| """ | ||
| # SigLIP2 uses Siglip2ImageProcessor / Siglip2ImageProcessorFast | ||
| image_processor = getattr(processor, "image_processor", None) | ||
| if image_processor is not None: | ||
| mod = getattr(image_processor, "__module__", "") or "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoTokenizer.from_pretrained() will continue to return GemmaTokenizer
This is no longer true, AutoTokenizer will check if the serialized class matched the tokenizer_mappin[model_type] and as it does not, we can enforce whatever we want
| def __init__(self, image_processor, tokenizer): | ||
| super().__init__(image_processor, tokenizer) | ||
|
|
||
| def __call__(self, images=None, text=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be done here
| return x | ||
|
|
||
|
|
||
| class Siglip2Tokenizer(GemmaTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never use inheritance in transformers like this. Lowercasing is a normalizer from normalizers.Lowercase that just needs to be added to the sequence of normalizers for siglip models.
| text = _lowercase_text(text) | ||
|
|
||
| # SigLIP2 text encoder trained with padding to 64 tokens | ||
| # Only set defaults if the user didn't specify them. | ||
| kwargs.setdefault("padding", "max_length") | ||
| kwargs.setdefault("truncation", True) | ||
| kwargs.setdefault("max_length", 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed now? I think these are saved in tokenizer/processor config, except for lowercase
| @@ -0,0 +1,58 @@ | |||
| # Copyright 2024 The HuggingFace Inc. team. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2025
| return x | ||
|
|
||
|
|
||
| class Siglip2Tokenizer(GemmaTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are to inherit from existing tokenizer class, we better place it in modular. Modular will take care of copying the code and making sure that Siglip2 doesn't depend on Gemma at run-time
| # SigLIP2 training: fixed padding/truncation to 64 | ||
| kwargs.setdefault("padding", "max_length") | ||
| kwargs.setdefault("truncation", True) | ||
| kwargs.setdefault("max_length", 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this. It also has to be in saved config and we can also add default values in ModelProcessorKwargs
| text = _lowercase_text(text) | ||
| text_pair = _lowercase_text(text_pair) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see that configs on the hub have a field do_lower_case, imo we need to lower case if that field is set to True
|
|
||
| @require_torch | ||
| @require_tokenizers | ||
| @slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it slow?
| @require_torch | ||
| @require_tokenizers | ||
| @slow | ||
| class Siglip2ProcessorTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's inherit from ProcessorTesterMixin to get common tests
| @require_torch | ||
| @require_tokenizers | ||
| @slow | ||
| class Siglip2TokenizerTest(unittest.TestCase): | ||
| def test_lowercasing_and_padding_defaults(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, not slow and inherit from TokenizerTesterMixin to get common tests. Also tokenizer tests are usually in their own separate file
|
Thanks for detailed review of this @zucchini-nlp @ArthurZucker |
What does this PR do?
This PR makes SigLIP2 text preprocessing explicit and consistent with how the model was trained.
It introduces a model specific tokenizer (Siglip2Tokenizer) that wraps the existing GemmaTokenizer while enforcing SigLIP2’s training-time defaults (lowercasing and fixed padding/truncation to 64 tokens). AutoTokenizer and AutoProcessor are updated to use this tokenizer for SigLIP2 checkpoints, so users get the correct behavior automatically without relying on implicit processor logic.
The underlying tokenization and vocabulary remain unchanged. This is a lightweight wrapper that improves correctness, reproducibility, and clarity, especially for text embedding and retrieval use cases.
This addresses the behavior reported in #43054.
Fixes # 43054
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @itazap
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.