Dedicated tokenizer for byte level transformers

### Feature request

There are alternative transformer architectures that handle bytes directly:
- [byte latent transformers](https://arxiv.org/abs/2412.09871) by Meta
- [ByT5](https://arxiv.org/abs/2105.13626) by Google
- [tokun](https://huggingface.co/blog/apehex/this-title-is-already-tokenized) by me 👾

Instead of tokenizing according to a vocabulary, the idea would be to get the raw encoding bytes.

### Motivation

Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.

### Your contribution

I have a draft that I will PR shortly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dedicated tokenizer for byte level transformers #36202

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dedicated tokenizer for byte level transformers #36202

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions