System Info
- Ubuntu 22.04
- Python 3.10.12
- Transformers Version: from source
Who can help?
@ArthurZucker
Information
Tasks
Reproduction
I'm trying to load MistralTokenizer using AutoTokenizer from mistral model in the following code snippet:
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
auto_tokenizer = AutoTokenizer.from_pretrained(model_id)
When I inspect auto_tokenizer variable, then I get LlamaTokenizerFast:
LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-Instruct-v0.3', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
...
}
I don't know if I'm missing something, but it is loading a different tokenizer than I expected.
Expected behavior
IMHO it should instantiate a MistralTokenizer.v3() tokenizer as implemented in mistral-common. I checked the TOKENIZER_MAPPING object, and Mistral isn't even listed there.
System Info
Who can help?
@ArthurZucker
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
I'm trying to load MistralTokenizer using AutoTokenizer from mistral model in the following code snippet:
When I inspect auto_tokenizer variable, then I get LlamaTokenizerFast:
I don't know if I'm missing something, but it is loading a different tokenizer than I expected.
Expected behavior
IMHO it should instantiate a MistralTokenizer.v3() tokenizer as implemented in mistral-common. I checked the
TOKENIZER_MAPPINGobject, and Mistral isn't even listed there.