Skip to content

Commit bf2c36a

Browse files
authored
Merge pull request #1 from huggingface/master
update
2 parents fd97761 + 82f6abd commit bf2c36a

File tree

4 files changed

+537
-0
lines changed

4 files changed

+537
-0
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
name: "\U0001F5A5 New Benchmark"
3+
about: You benchmark a part of this library and would like to share your results
4+
title: "[Benchmark]"
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
# Benchmarking Transformers
11+
12+
## Benchmark
13+
14+
Which part of Transformers did you benchmark?
15+
16+
## Set-up
17+
18+
What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
19+
20+
## Results
21+
22+
Put your results here!

docs/source/benchmarks.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Benchmarks
2+
3+
This section is dedicated to the Benchmarks done by the library, both by maintainers, contributors and users. These
4+
benchmark will help keep track of the preformance improvements that are brought to our models across versions.
5+
6+
## Benchmarking all models for inference
7+
8+
As of version 2.1 we have benchmarked all models for inference, across many different settings: using PyTorch, with
9+
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
10+
TensorFlow XLA) and GPUs.
11+
12+
The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2)
13+
14+
The results are available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
15+
16+
## TF2 with mixed precision, XLA, Distribution (@tlkh)
17+
18+
This work was done by [Timothy Liu](https://github.com/tlkh).
19+
20+
There are very positive results to be gained from the various TensorFlow 2.0 features:
21+
22+
- Automatic Mixed Precision (AMP)
23+
- XLA compiler
24+
- Distribution strategies (multi-GPU)
25+
26+
The benefits are listed here (tested on CoLA, MRPC, SST-2):
27+
28+
- AMP: Between 1.4x to 1.6x decrease in overall time without change in batch size
29+
- AMP+XLA: Up to 2.5x decrease in overall time on SST-2 (larger dataset)
30+
- Distribution: Between 1.4x to 3.4x decrease in overall time on 4xV100
31+
- Combined: Up to 5.7x decrease in overall training time, or 9.1x training throughput
32+
33+
The model quality (measured by the validation accuracy) fluctuates slightly. Taking an average of 4 training runs
34+
on a single GPU gives the following results:
35+
36+
- CoLA: AMP results in slighter lower acc (0.820 vs 0.824)
37+
- MRPC: AMP results in lower acc (0.823 vs 0.835)
38+
- SST-2: AMP results in slighter lower acc (0.918 vs 0.922)
39+
40+
However, in a distributed setting with 4xV100 (4x batch size), AMP can yield in better results:
41+
42+
CoLA: AMP results in higher acc (0.828 vs 0.812)
43+
MRPC: AMP results in lower acc (0.817 vs 0.827)
44+
SST-2: AMP results in slightly lower acc (0.926 vs 0.929)
45+
46+
The benchmark script is available [here](https://github.com/NVAITC/benchmarking/blob/master/tf2/bert_dist.py).
47+
48+
Note: on some tasks (e.g. MRPC), the dataset is too small. The overhead due to the model compilation with XLA as well
49+
as the distribution strategy setup does not speed things up. The XLA compile time is also the reason why although throughput
50+
can increase a lot (e.g. 2.7x for single GPU), overall (end-to-end) training speed-up is not as fast (as low as 1.4x)
51+
52+
The benefits as seen on SST-2 (larger dataset) is much clear.
53+
54+
All results can be seen on this [Google Sheet](https://docs.google.com/spreadsheets/d/1538MN224EzjbRL239sqSiUy6YY-rAjHyXhTzz_Zptls/edit#gid=960868445).

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
6363
bertology
6464
torchscript
6565
multilingual
66+
benchmarks
6667

6768
.. toctree::
6869
:maxdepth: 2

0 commit comments

Comments
 (0)