-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Support tokenization_kwargs override #29794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tokenization_kwargs override #29794
Conversation
Signed-off-by: piood <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter to the embedding, classification, and scoring APIs. The changes are well-implemented by plumbing this new parameter through the call stack. However, I've found a critical issue in the test utility function get_inputs where the passed processor_kwargs dictionary is modified in-place within a loop, which could lead to incorrect test behavior. I've provided a code suggestion to fix this.
Signed-off-by: piood <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter in various API methods. The changes are well-implemented for the most part, correctly plumbing the new parameter through the call stack. The tests have also been updated to cover the new functionality. However, I've identified a potential issue in _cross_encoding_score where tokenization_kwargs is used in two different places in the call chain, which could lead to confusion and bugs. My review comment details this concern.
Signed-off-by: piood <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for tokenization_kwargs to customize tokenization behavior in embedding, classification, and scoring APIs. The changes correctly plumb the new parameter through the different layers of the application. The tests have also been updated to cover the new functionality. I've found a couple of high-severity issues where the user-provided tokenization_kwargs dictionary is modified in-place, which can lead to unexpected side effects. I've provided suggestions to fix this by creating a copy of the dictionary.
Signed-off-by: piood <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for overriding tokenization behavior via tokenization_kwargs. The changes are well-implemented across the embedding and classification APIs. However, I've identified a few critical issues where passing conflicting keys in tokenization_kwargs could lead to a TypeError. I've also noted a high-severity issue where multimodal arguments could be dropped during cross-encoder scoring.
Additionally, the public score() method was not updated to accept tokenization_kwargs, which makes this new feature inaccessible for scoring tasks. Please consider adding this parameter to the score() method and passing it to the internal _embedding_score and _cross_encoding_score methods to complete the feature.
Signed-off-by: piood <[email protected]>
Signed-off-by: piood <[email protected]>
|
All tests still pass locally after implementing the suggested fixes. Please review it, thanks! @DarkLight1337 |
Signed-off-by: piood <[email protected]>
2caeb7c to
bdfb80f
Compare
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and sorry for the delay!
Purpose
Add support for tokenization_kwargs parameter in embedding, classification, and scoring APIs to allow users to override tokenization behavior during inference.
This enhancement enables users to customize tokenization parameters (e.g., padding, max_length, truncation) when calling embedding-related methods (embed(), classify(), score()), providing more flexibility for handling different input scenarios.
Fix #27566 (comment)
Test Plan
Run the SiGLIP multimodal pooling tests with custom tokenization parameters:
pytest tests/models/multimodal/pooling/test_siglip.pyThe test suite now validates:
Test Result
All SiGLIP pooling tests pass successfully with the following configurations:
✅ test_models_text - validates custom tokenization_kwargs (padding="max_length", max_length=64)
✅ test_models_image - validates default behavior with empty tokenization_kwargs