Feature request
There are alternative transformer architectures that handle bytes directly:
Instead of tokenizing according to a vocabulary, the idea would be to get the raw encoding bytes.
Motivation
Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.
Your contribution
I have a draft that I will PR shortly!