This repository contains code to run faster feature extractors using tools like quantization, optimization and ONNX. Just run your model much faster, while using less of memory. There is not much to it!
Phillip Schmid: "We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 25.6ms to 12.3ms or 2.09x while keeping 100% of the accuracy on the stsb dataset. But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.""
pip install fast-sentence-transformersOr, for GPU support:
pip install fast-sentence-transformers[gpu]from fast_sentence_transformers import FastSentenceTransformer as SentenceTransformer
# use any sentence-transformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")
encoder.encode("Hello hello, hey, hello hello")
encoder.encode(["Life is too short to eat bad food!"] * 2)Non-exact, indicative benchmark for speed an memory usage with smaller and larger model on sentence-transformers
| model | Type | default | ONNX | ONNX+quantized | ONNX+GPU |
|---|---|---|---|---|---|
| paraphrase-albert-small-v2 | memory | 1x | 1x | 1x | 1x |
| speed | 1x | 2x | 5x | 20x | |
| paraphrase-multilingual-mpnet-base-v2 | memory | 1x | 1x | 4x | 4x |
| speed | 1x | 2x | 5x | 20x |
This package heavily leans on https://www.philschmid.de/optimize-sentence-transformers.