Skip to content

Dedicated tokenizer for byte level transformers #36202

@apehex

Description

@apehex

Feature request

There are alternative transformer architectures that handle bytes directly:

Instead of tokenizing according to a vocabulary, the idea would be to get the raw encoding bytes.

Motivation

Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.

Your contribution

I have a draft that I will PR shortly!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions