Skip to content

Conversation

@piood
Copy link
Contributor

@piood piood commented Dec 1, 2025

Purpose

Add support for tokenization_kwargs parameter in embedding, classification, and scoring APIs to allow users to override tokenization behavior during inference.
This enhancement enables users to customize tokenization parameters (e.g., padding, max_length, truncation) when calling embedding-related methods (embed(), classify(), score()), providing more flexibility for handling different input scenarios.

Fix #27566 (comment)

Test Plan

Run the SiGLIP multimodal pooling tests with custom tokenization parameters:

pytest tests/models/multimodal/pooling/test_siglip.py

The test suite now validates:

  • Text-only inputs with custom padding and max_length settings
  • Image-only inputs with default tokenization behavior

Test Result

All SiGLIP pooling tests pass successfully with the following configurations:
✅ test_models_text - validates custom tokenization_kwargs (padding="max_length", max_length=64)
✅ test_models_image - validates default behavior with empty tokenization_kwargs

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added frontend multi-modality Related to multi-modality (#4194) labels Dec 1, 2025
@piood piood mentioned this pull request Dec 1, 2025
5 tasks
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter to the embedding, classification, and scoring APIs. The changes are well-implemented by plumbing this new parameter through the call stack. However, I've found a critical issue in the test utility function get_inputs where the passed processor_kwargs dictionary is modified in-place within a loop, which could lead to incorrect test behavior. I've provided a code suggestion to fix this.

Signed-off-by: piood <[email protected]>
@piood
Copy link
Contributor Author

piood commented Dec 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter in various API methods. The changes are well-implemented for the most part, correctly plumbing the new parameter through the call stack. The tests have also been updated to cover the new functionality. However, I've identified a potential issue in _cross_encoding_score where tokenization_kwargs is used in two different places in the call chain, which could lead to confusion and bugs. My review comment details this concern.

Signed-off-by: piood <[email protected]>
@piood
Copy link
Contributor Author

piood commented Dec 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for tokenization_kwargs to customize tokenization behavior in embedding, classification, and scoring APIs. The changes correctly plumb the new parameter through the different layers of the application. The tests have also been updated to cover the new functionality. I've found a couple of high-severity issues where the user-provided tokenization_kwargs dictionary is modified in-place, which can lead to unexpected side effects. I've provided suggestions to fix this by creating a copy of the dictionary.

Signed-off-by: piood <[email protected]>
@piood
Copy link
Contributor Author

piood commented Dec 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for overriding tokenization behavior via tokenization_kwargs. The changes are well-implemented across the embedding and classification APIs. However, I've identified a few critical issues where passing conflicting keys in tokenization_kwargs could lead to a TypeError. I've also noted a high-severity issue where multimodal arguments could be dropped during cross-encoder scoring.

Additionally, the public score() method was not updated to accept tokenization_kwargs, which makes this new feature inaccessible for scoring tasks. Please consider adding this parameter to the score() method and passing it to the internal _embedding_score and _cross_encoding_score methods to complete the feature.

@piood
Copy link
Contributor Author

piood commented Dec 2, 2025

All tests still pass locally after implementing the suggested fixes. Please review it, thanks! @DarkLight1337

Signed-off-by: piood <[email protected]>
@piood piood force-pushed the add_override_tokenization_kwargs branch from 2caeb7c to bdfb80f Compare December 6, 2025 02:58
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and sorry for the delay!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) December 6, 2025 03:17
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 6, 2025
@DarkLight1337 DarkLight1337 merged commit 43e7593 into vllm-project:main Dec 6, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants