Skip to content

Running AutoTokenizer.from_pretrained with Mistral V3 is actually loading LlamaTokenizer #31375

@matheus-prandini

Description

@matheus-prandini

System Info

  • Ubuntu 22.04
  • Python 3.10.12
  • Transformers Version: from source

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm trying to load MistralTokenizer using AutoTokenizer from mistral model in the following code snippet:

from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
auto_tokenizer = AutoTokenizer.from_pretrained(model_id)

When I inspect auto_tokenizer variable, then I get LlamaTokenizerFast:

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-Instruct-v0.3', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	...
}

I don't know if I'm missing something, but it is loading a different tokenizer than I expected.

Expected behavior

IMHO it should instantiate a MistralTokenizer.v3() tokenizer as implemented in mistral-common. I checked the TOKENIZER_MAPPING object, and Mistral isn't even listed there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions