Skip to content

Respect tokenizer.json padding settings in HuggingFaceEmbedder#36071

Merged
glebashnik merged 1 commit intovespa-engine:masterfrom
s-smits:fix/respect-tokenizer-padding
Mar 10, 2026
Merged

Respect tokenizer.json padding settings in HuggingFaceEmbedder#36071
glebashnik merged 1 commit intovespa-engine:masterfrom
s-smits:fix/respect-tokenizer-padding

Conversation

@s-smits
Copy link
Contributor

@s-smits s-smits commented Mar 2, 2026

Fix #35600

Summary

HuggingFaceEmbedder hardcoded .setPadding(false), ignoring padding configuration from tokenizer.json. Models like siglip2-base-patch16-384 that require fixed-length padding produced incorrect embeddings.

Replace hardcoded false with info.padding() != DO_NOT_PAD, which comes from the tokenizer's own metadata via ModelInfo.

Backwards compatible: existing tokenizers without padding config have padding == DO_NOT_PAD, producing the same setPadding(false) as before.

Changes

  • HuggingFaceEmbedder.java: .setPadding(false).setPadding(info.padding() != DO_NOT_PAD)

I've also included a test (testPadding() in HuggingFaceEmbedderTest) and a test fixture (tokenizer_with_padding.json) to verify the fix. I can adjust or remove these if you prefer a different testing approach.

Test plan

  • New testPadding() verifies padded tokenizer produces 10 tokens, unpadded produces 6
  • All 20 existing embedder tests pass (HuggingFace, ColBert, Splade)

Claude Code has been used to assist with this PR.

Fix vespa-engine#35600: Replace hardcoded .setPadding(false) with
.setPadding(info.padding() != DO_NOT_PAD), which comes from the
tokenizer's own metadata via ModelInfo.
@s-smits s-smits force-pushed the fix/respect-tokenizer-padding branch from cff4ecb to eec5c32 Compare March 2, 2026 15:04
@s-smits
Copy link
Contributor Author

s-smits commented Mar 2, 2026

Testing note: All 8 existing tests in HuggingFaceEmbedderTest pass unchanged, real_tokenizer.json has padding: null (DO_NOT_PAD), so the fix produces identical behavior to the old hardcode.

A dedicated test could look like:

@Test
void testPaddingRespectedFromTokenizer() {
    var tokenizerPath = Paths.get("src/test/models/onnx/transformer/real_tokenizer.json");
    var info = HuggingFaceTokenizer.getModelInfo(tokenizerPath);

    assertEquals(ModelInfo.PaddingStrategy.DO_NOT_PAD, info.padding());
}

Copy link
Member

@glebashnik glebashnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@glebashnik glebashnik merged commit f8f5d41 into vespa-engine:master Mar 10, 2026
2 of 3 checks passed
bjorncs added a commit that referenced this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hugging Face embedder does not respect tokenizer padding settings

3 participants