Respect tokenizer.json padding settings in HuggingFaceEmbedder#36071
Merged
glebashnik merged 1 commit intovespa-engine:masterfrom Mar 10, 2026
Merged
Respect tokenizer.json padding settings in HuggingFaceEmbedder#36071glebashnik merged 1 commit intovespa-engine:masterfrom
glebashnik merged 1 commit intovespa-engine:masterfrom
Conversation
Fix vespa-engine#35600: Replace hardcoded .setPadding(false) with .setPadding(info.padding() != DO_NOT_PAD), which comes from the tokenizer's own metadata via ModelInfo.
cff4ecb to
eec5c32
Compare
Contributor
Author
|
Testing note: All 8 existing tests in A dedicated test could look like: @Test
void testPaddingRespectedFromTokenizer() {
var tokenizerPath = Paths.get("src/test/models/onnx/transformer/real_tokenizer.json");
var info = HuggingFaceTokenizer.getModelInfo(tokenizerPath);
assertEquals(ModelInfo.PaddingStrategy.DO_NOT_PAD, info.padding());
} |
This was referenced Mar 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #35600
Summary
HuggingFaceEmbedder hardcoded
.setPadding(false), ignoring padding configuration fromtokenizer.json. Models likesiglip2-base-patch16-384that require fixed-length padding produced incorrect embeddings.Replace hardcoded
falsewithinfo.padding() != DO_NOT_PAD, which comes from the tokenizer's own metadata viaModelInfo.Backwards compatible: existing tokenizers without padding config have
padding == DO_NOT_PAD, producing the samesetPadding(false)as before.Changes
HuggingFaceEmbedder.java:.setPadding(false)→.setPadding(info.padding() != DO_NOT_PAD)I've also included a test (
testPadding()inHuggingFaceEmbedderTest) and a test fixture (tokenizer_with_padding.json) to verify the fix. I can adjust or remove these if you prefer a different testing approach.Test plan
testPadding()verifies padded tokenizer produces 10 tokens, unpadded produces 6Claude Code has been used to assist with this PR.