Hello,
I am trying to use tiktoken to tokenize some texts that contain math symbols like ∩, ⊆, `A⊇B. But, tiktoken failed to decode this to a single token.
code:
import tiktoken
encoding = tiktoken.encoding_for_model('gpt-4')
encoded_message = encoding.encode("∩")
decoded_tokens = [encoding.decode_single_token_bytes(token).decode("utf-8") for token in encoded_message]
print(decoded_tokens)
I tried to use encoding.decode() method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.
Any help?