Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880

JqzChandler · 2025-10-30T13:59:29Z

Our team tried different aligning solutions when implementing our self-developed On-Policy distillation method, including trl's existing implementation (which repeatedly calls the decode() method but encounters correctness issues in special cases and has high computational overhead).

Later we found a better method: by making one call to the encode() method to get byte-level offsets for all tokens, we can effectively avoid BPE's complexity, and byte level offsets are also compatible with all other types of tokenizers. Additionally, for distillation between two BPE tokenizers, we can get more accurate alignment by skipping string as an intermediate modality.

Therefore, we hope to merge this simple patch to expose the byte-level offset calculation already supported in the Rust code for use by Python classes.

More description at:
huggingface/trl#4393

Copilot

Pull Request Overview

This PR adds an offset_type parameter to the encode and encode_batch methods, allowing users to choose between character-based offsets ("char"), byte-based offsets ("byte"), or no offsets ("none") for faster encoding. The default is "char" to maintain backward compatibility.

Adds offset_type parameter to both Rust and Python encoding methods
Routes to appropriate internal methods based on offset type selection
Provides input validation with helpful error messages

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
bindings/python/src/tokenizer.rs	Adds `offset_type` parameter to `encode` and `encode_batch` Rust methods with validation and routing logic
bindings/python/py_src/tokenizers/implementations/base_tokenizer.py	Adds `offset_type` parameter to Python wrapper methods with documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T12:34:41Z

bindings/python/src/tokenizer.rs

+                    .into_pyerr::<exceptions::PyValueError>())
+                }
+            };
+


Trailing whitespace should be removed from this empty line for consistency with code style.

Suggested change

enable_getting_encoding_offsets_at_diff_lvl

452978c

This was referenced Oct 30, 2025

Problems with Cross-Tokenizer Alignment in Correctness and Efficiency huggingface/trl#4393

Open

Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1881

Open

kashif requested review from ArthurZucker and Copilot October 31, 2025 12:32

Copilot AI reviewed Oct 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880

JqzChandler commented Oct 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880

Are you sure you want to change the base?

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1880

Conversation

JqzChandler commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JqzChandler commented Oct 30, 2025 •

edited

Loading