|
| 1 | +# Benchmarks |
| 2 | + |
| 3 | +This section is dedicated to the Benchmarks done by the library, both by maintainers, contributors and users. These |
| 4 | +benchmark will help keep track of the preformance improvements that are brought to our models across versions. |
| 5 | + |
| 6 | +## Benchmarking all models for inference |
| 7 | + |
| 8 | +As of version 2.1 we have benchmarked all models for inference, across many different settings: using PyTorch, with |
| 9 | +and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for |
| 10 | +TensorFlow XLA) and GPUs. |
| 11 | + |
| 12 | +The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2) |
| 13 | + |
| 14 | +The results are available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing). |
| 15 | + |
| 16 | +## TF2 with mixed precision, XLA, Distribution (@tlkh) |
| 17 | + |
| 18 | +This work was done by [Timothy Liu](https://github.com/tlkh). |
| 19 | + |
| 20 | +There are very positive results to be gained from the various TensorFlow 2.0 features: |
| 21 | + |
| 22 | +- Automatic Mixed Precision (AMP) |
| 23 | +- XLA compiler |
| 24 | +- Distribution strategies (multi-GPU) |
| 25 | + |
| 26 | +The benefits are listed here (tested on CoLA, MRPC, SST-2): |
| 27 | + |
| 28 | +- AMP: Between 1.4x to 1.6x decrease in overall time without change in batch size |
| 29 | +- AMP+XLA: Up to 2.5x decrease in overall time on SST-2 (larger dataset) |
| 30 | +- Distribution: Between 1.4x to 3.4x decrease in overall time on 4xV100 |
| 31 | +- Combined: Up to 5.7x decrease in overall training time, or 9.1x training throughput |
| 32 | + |
| 33 | +The model quality (measured by the validation accuracy) fluctuates slightly. Taking an average of 4 training runs |
| 34 | +on a single GPU gives the following results: |
| 35 | + |
| 36 | +- CoLA: AMP results in slighter lower acc (0.820 vs 0.824) |
| 37 | +- MRPC: AMP results in lower acc (0.823 vs 0.835) |
| 38 | +- SST-2: AMP results in slighter lower acc (0.918 vs 0.922) |
| 39 | + |
| 40 | +However, in a distributed setting with 4xV100 (4x batch size), AMP can yield in better results: |
| 41 | + |
| 42 | +CoLA: AMP results in higher acc (0.828 vs 0.812) |
| 43 | +MRPC: AMP results in lower acc (0.817 vs 0.827) |
| 44 | +SST-2: AMP results in slightly lower acc (0.926 vs 0.929) |
| 45 | + |
| 46 | +The benchmark script is available [here](https://github.com/NVAITC/benchmarking/blob/master/tf2/bert_dist.py). |
| 47 | + |
| 48 | +Note: on some tasks (e.g. MRPC), the dataset is too small. The overhead due to the model compilation with XLA as well |
| 49 | +as the distribution strategy setup does not speed things up. The XLA compile time is also the reason why although throughput |
| 50 | +can increase a lot (e.g. 2.7x for single GPU), overall (end-to-end) training speed-up is not as fast (as low as 1.4x) |
| 51 | + |
| 52 | +The benefits as seen on SST-2 (larger dataset) is much clear. |
| 53 | + |
| 54 | +All results can be seen on this [Google Sheet](https://docs.google.com/spreadsheets/d/1538MN224EzjbRL239sqSiUy6YY-rAjHyXhTzz_Zptls/edit#gid=960868445). |
0 commit comments