|
| 1 | +--- |
| 2 | +language: id |
| 3 | +tags: |
| 4 | +- indobert |
| 5 | +- indolem |
| 6 | +license: mit |
| 7 | +inference: false |
| 8 | +datasets: |
| 9 | +- 220M words (IndoWiki, IndoWC, News) |
| 10 | +--- |
| 11 | + |
| 12 | +## About |
| 13 | + |
| 14 | +[IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources: |
| 15 | +* Indonesian Wikipedia (74M words) |
| 16 | +* news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total) |
| 17 | +* an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words). |
| 18 | + |
| 19 | +We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being <b>3.97</b> (similar to English BERT-base). |
| 20 | + |
| 21 | +This <b>IndoBERT</b> was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. |
| 22 | + |
| 23 | +| Task | Metric | Bi-LSTM | mBERT | MalayBERT | IndoBERT | |
| 24 | +| ---- | ---- | ---- | ---- | ---- | ---- | |
| 25 | +| POS Tagging | Acc | 95.4 | <b>96.8</b> | <b>96.8</b> | <b>96.8</b> | |
| 26 | +| NER UGM | F1| 70.9 | 71.6 | 73.2 | <b>74.9</b> | |
| 27 | +| NER UI | F1 | 82.2 | 82.2 | 87.4 | <b>90.1</b> | |
| 28 | +| Dep. Parsing (UD-Indo-GSD) | UAS/LAS | 85.25/80.35 | 86.85/81.78 | 86.99/81.87 | <b>87.12<b/>/<b>82.32</b> | |
| 29 | +| Dep. Parsing (UD-Indo-PUD) | UAS/LAS | 84.04/79.01 | <b>90.58</b>/<b>85.44</b> | 88.91/83.56 | 89.23/83.95 | |
| 30 | +| Sentiment Analysis | F1 | 71.62 | 76.58 | 82.02 | <b>84.13</b> | |
| 31 | +| Summarization | R1/R2/RL | 67.96/61.65/67.24 | 68.40/61.66/67.67 | 68.44/61.38/67.71 | <b>69.93</b>/<b>62.86</b>/<b>69.21</b> | |
| 32 | +| Next Tweet Prediction | Acc | 73.6 | 92.4 | 93.1 | <b>93.7</b> | |
| 33 | +| Tweet Ordering | Spearman corr. | 0.45 | 0.53 | 0.51 | <b>0.59</b> | |
| 34 | + |
| 35 | +The paper is published at the 28th COLING 2020. Please refer to https://indolem.github.io for more details about the benchmarks. |
| 36 | + |
| 37 | +## How to use |
| 38 | + |
| 39 | +### Load model and tokenizer (tested with transformers==3.5.1) |
| 40 | +```python |
| 41 | +from transformers import AutoTokenizer, AutoModel |
| 42 | +tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased") |
| 43 | +model = AutoModel.from_pretrained("indolem/indobert-base-uncased") |
| 44 | +``` |
| 45 | + |
| 46 | +## Citation |
| 47 | +If you use our work, please cite: |
| 48 | + |
| 49 | +```bibtex |
| 50 | +@inproceedings{koto2020indolem, |
| 51 | + title={IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP}, |
| 52 | + author={Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin}, |
| 53 | + booktitle={Proceedings of the 28th COLING}, |
| 54 | + year={2020} |
| 55 | +} |
| 56 | +``` |
0 commit comments