Skip to content

Conversation

@Coqueue
Copy link
Contributor

@Coqueue Coqueue commented Apr 21, 2025

Hi, this change is to correct type notations of merges in Python binding of BPE as the counterpart in Rust implementation expects a file name string or Vec<(String, String)> instead of a mapping from a tuple of integers to another.

This is a second PR with the same change as PR#1685, which got closed with replies to questions from an reviewer that got ignored. I'll include the reasoning for this changes below.

Taking the Python SentencePieceBPETokenizer class as an example, the merges annotated as Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] is only used to initialize the BPE class:

if vocab is not None and merges is not None:
        tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))

where the BPE class is generated from the Rust implementation mentioned in the original post.

BPE class also contains below comment:

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm

    Args:
        vocab (:obj:`Dict[str, int]`, `optional`):
            A dictionary of string keys and their ids :obj:`{"am": 0,...}`

        merges (:obj:`List[Tuple[str, str]]`, `optional`):
            A list of pairs of tokens (:obj:`Tuple[str, str]`) :obj:`[("a", "b"),...]`

This comment also gives a different type annotation for merges.

I encountered an issue initializing SentencePieceBPETokenizer with the given type annotation, but succeeded following the type annotation given in the comment of BPE class.

The type annotation of merges being Dict[Tuple[int, int], Tuple[int, int]]]] should actually be for MergesMap in Rust, as an alias of HashMap<Pair, (u32, u32)>, which is built internally in Rust from merges.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! These files (in implementations) are kinda references, but are also not super maintained! Thanks 🤗

@ArthurZucker ArthurZucker merged commit cc01186 into huggingface:main May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants