|
| 1 | +--- |
| 2 | +language: en |
| 3 | +license: mit |
| 4 | +datasets: |
| 5 | +- AI4Bharat IndicNLP Corpora |
| 6 | +--- |
| 7 | + |
| 8 | +# IndicBERT |
| 9 | + |
| 10 | +IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models. |
| 11 | + |
| 12 | +The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. |
| 13 | + |
| 14 | +The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf). |
| 15 | + |
| 16 | + |
| 17 | + |
| 18 | +## Pretraining Corpus |
| 19 | + |
| 20 | +We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages: |
| 21 | + |
| 22 | + |
| 23 | +| Language | as | bn | en | gu | hi | kn | | |
| 24 | +| ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- | |
| 25 | +| **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | | |
| 26 | +| **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** | |
| 27 | +| **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B | |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +## Evaluation Results |
| 32 | + |
| 33 | +IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert) |
| 34 | + |
| 35 | +#### IndicGLUE |
| 36 | + |
| 37 | +Task | mBERT | XLM-R | IndicBERT |
| 38 | +-----| ----- | ----- | ------ |
| 39 | +News Article Headline Prediction | 89.58 | 95.52 | **95.87** |
| 40 | +Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31 |
| 41 | +Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87** |
| 42 | +Article Genre Classification | 90.63 | 97.03 | **97.34** |
| 43 | +Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47 |
| 44 | +Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12** |
| 45 | +Average | 64.62 | 61.09 | **66.66** |
| 46 | + |
| 47 | +#### Additional Tasks |
| 48 | + |
| 49 | + |
| 50 | +Task | Task Type | mBERT | XLM-R | IndicBERT |
| 51 | +-----| ----- | ----- | ------ | ----- |
| 52 | +BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60 |
| 53 | +IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32 |
| 54 | +IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03 |
| 55 | +Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45 |
| 56 | +Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44 |
| 57 | +iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52** |
| 58 | +ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18** |
| 59 | +Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34** |
| 60 | +Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33** |
| 61 | +Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75 |
| 62 | +Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33** |
| 63 | +Average | | 69.84 | **74.42** | 73.66 |
| 64 | + |
| 65 | + |
| 66 | +\* Note: all models have been restricted to a max_seq_length of 128. |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +## Downloads |
| 71 | + |
| 72 | +The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert). |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +## Citing |
| 77 | + |
| 78 | +If you are using any of the resources, please cite the following article: |
| 79 | + |
| 80 | +``` |
| 81 | +@inproceedings{kakwani2020indicnlpsuite, |
| 82 | + title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}}, |
| 83 | + author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar}, |
| 84 | + year={2020}, |
| 85 | + booktitle={Findings of EMNLP}, |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +We would like to hear from you if: |
| 90 | + |
| 91 | +- You are using our resources. Please let us know how you are putting these resources to use. |
| 92 | +- You have any feedback on these resources. |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +## License |
| 97 | + |
| 98 | +The IndicBERT code (and models) are released under the MIT License. |
| 99 | + |
| 100 | +## Contributors |
| 101 | + |
| 102 | +- Divyanshu Kakwani |
| 103 | +- Anoop Kunchukuttan |
| 104 | +- Gokul NC |
| 105 | +- Satish Golla |
| 106 | +- Avik Bhattacharyya |
| 107 | +- Mitesh Khapra |
| 108 | +- Pratyush Kumar |
| 109 | + |
| 110 | +This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org). |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +## Contact |
| 115 | + |
| 116 | +- Anoop Kunchukuttan ( [[email protected]](mailto:[email protected])) |
| 117 | + |
| 118 | + |
0 commit comments