tokenizers_intrinsic_benchmark

Code for the paper Greed is All You Need: An Evaluation of Tokenizer Inference Methods

Requirements

Python packages are listed in requirements.txt. This code does not require GPU\TPU.

Notes

The benchmark supports tokenizers which are serialized into a HuggingFace json format. In addition, we've added support for some custom inference methods (greedy longest suffix, greedy lognest token, etc.). The json files we've used for the paper will be added soon as examples.

Resources

The resources we've used for the evaluation are in the resources folder.

Resource	Reference
LADEC	paper
MorphoLex	paper
MorphyNet	paper
DagoBert	paper
UniMorph	paper
UnBlend	paper
CompoundPiece	paper
Cognitive data	paper
tokenization-scorer	paper

Execution

Execute main.py from its working directory.

arguments:

	--tokenizers: a path to a txt file containing paths to tokenizers config files in JSON format. Default is tokenizers.txt in the working directory.
	--compare: a boolean argument for comparing the segmentation difference between inference methods. Default is False. If enabled make sure the default segmentation is the first path in the tokenizers paths file (and that the vocabulary is shared by all tokenizers).

Example:

python main.py \
        --tokenizers tokenizers.txt

Citation

@inproceedings{uzan-etal-2024-greed,
    title = "Greed is All You Need: An Evaluation of Tokenizer Inference Methods",
    author = "Uzan, Omri  and
      Schmidt, Craig W.  and
      Tanner, Chris  and
      Pinter, Yuval",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-short.73",
    doi = "10.18653/v1/2024.acl-short.73",
    pages = "813--822",
    abstract = "While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Intrinsic_measures		Intrinsic_measures
Resources/en		Resources/en
LICENSE		LICENSE
README.md		README.md
benchmark_objects.py		benchmark_objects.py
const.py		const.py
main.py		main.py
requirements.txt		requirements.txt
tokenizers.txt		tokenizers.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokenizers_intrinsic_benchmark

Requirements

Notes

Resources

Execution

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tokenizers_intrinsic_benchmark

Requirements

Notes

Resources

Execution

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages