Skip to content

Commit 35fd3d6

Browse files
authored
Add model card for ai4bharat/indic-bert (#8464)
1 parent 38f01df commit 35fd3d6

1 file changed

Lines changed: 118 additions & 0 deletions

File tree

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
language: en
3+
license: mit
4+
datasets:
5+
- AI4Bharat IndicNLP Corpora
6+
---
7+
8+
# IndicBERT
9+
10+
IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
11+
12+
The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
13+
14+
The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf).
15+
16+
17+
18+
## Pretraining Corpus
19+
20+
We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:
21+
22+
23+
| Language | as | bn | en | gu | hi | kn | |
24+
| ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- |
25+
| **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
26+
| **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** |
27+
| **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
28+
29+
30+
31+
## Evaluation Results
32+
33+
IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert)
34+
35+
#### IndicGLUE
36+
37+
Task | mBERT | XLM-R | IndicBERT
38+
-----| ----- | ----- | ------
39+
News Article Headline Prediction | 89.58 | 95.52 | **95.87**
40+
Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31
41+
Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87**
42+
Article Genre Classification | 90.63 | 97.03 | **97.34**
43+
Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47
44+
Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12**
45+
Average | 64.62 | 61.09 | **66.66**
46+
47+
#### Additional Tasks
48+
49+
50+
Task | Task Type | mBERT | XLM-R | IndicBERT
51+
-----| ----- | ----- | ------ | -----
52+
BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60
53+
IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32
54+
IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03
55+
Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45
56+
Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44
57+
iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52**
58+
ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18**
59+
Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34**
60+
Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33**
61+
Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75
62+
Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33**
63+
Average | | 69.84 | **74.42** | 73.66
64+
65+
66+
\* Note: all models have been restricted to a max_seq_length of 128.
67+
68+
69+
70+
## Downloads
71+
72+
The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert).
73+
74+
75+
76+
## Citing
77+
78+
If you are using any of the resources, please cite the following article:
79+
80+
```
81+
@inproceedings{kakwani2020indicnlpsuite,
82+
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
83+
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
84+
year={2020},
85+
booktitle={Findings of EMNLP},
86+
}
87+
```
88+
89+
We would like to hear from you if:
90+
91+
- You are using our resources. Please let us know how you are putting these resources to use.
92+
- You have any feedback on these resources.
93+
94+
95+
96+
## License
97+
98+
The IndicBERT code (and models) are released under the MIT License.
99+
100+
## Contributors
101+
102+
- Divyanshu Kakwani
103+
- Anoop Kunchukuttan
104+
- Gokul NC
105+
- Satish Golla
106+
- Avik Bhattacharyya
107+
- Mitesh Khapra
108+
- Pratyush Kumar
109+
110+
This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
111+
112+
113+
114+
## Contact
115+
116+
- Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
117+
- Mitesh Khapra ([[email protected]](mailto:[email protected]))
118+
- Pratyush Kumar ([[email protected]](mailto:[email protected]))

0 commit comments

Comments
 (0)