Skip to content

Conversation

@JqzChandler
Copy link

@JqzChandler JqzChandler commented Oct 30, 2025

issue
#1881

Our team tried different aligning solutions when implementing our self-developed On-Policy distillation method, including trl's existing implementation (which repeatedly calls the decode() method but encounters correctness issues in special cases and has high computational overhead).

Later we found a better method: by making one call to the encode() method to get byte-level offsets for all tokens, we can effectively avoid BPE's complexity, and byte level offsets are also compatible with all other types of tokenizers. Additionally, for distillation between two BPE tokenizers, we can get more accurate alignment by skipping string as an intermediate modality.

Therefore, we hope to merge this simple patch to expose the byte-level offset calculation already supported in the Rust code for use by Python classes.

More description at:
huggingface/trl#4393

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an offset_type parameter to the encode and encode_batch methods, allowing users to choose between character-based offsets ("char"), byte-based offsets ("byte"), or no offsets ("none") for faster encoding. The default is "char" to maintain backward compatibility.

  • Adds offset_type parameter to both Rust and Python encoding methods
  • Routes to appropriate internal methods based on offset type selection
  • Provides input validation with helpful error messages

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
bindings/python/src/tokenizer.rs Adds offset_type parameter to encode and encode_batch Rust methods with validation and routing logic
bindings/python/py_src/tokenizers/implementations/base_tokenizer.py Adds offset_type parameter to Python wrapper methods with documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.into_pyerr::<exceptions::PyValueError>())
}
};

Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace should be removed from this empty line for consistency with code style.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant