Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 20 additions & 3 deletions datasets/cc_news/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ It represents a small portion of the English language subset of the CC-News data

### Supported Tasks and Leaderboards

[N/A]
CC-News has been mostly used for language model training.

### Languages

Expand Down Expand Up @@ -113,7 +113,9 @@ CC-News dataset has only the training set, i.e. it has to be loaded with `train`

## Dataset Creation

CC-News has been mostly used for language model training.
### Curation Rationale

[More Information Needed]

### Source Data

Expand Down Expand Up @@ -176,4 +178,19 @@ The purpose of this dataset is to help language model researchers develop better

### Citation Information

[More Information Needed]
```
@InProceedings{Hamborg2017,
author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
title = {news-please: A Generic News Crawler and Extractor},
year = {2017},
booktitle = {Proceedings of the 15th International Symposium of Information Science},
location = {Berlin},
doi = {10.5281/zenodo.4120316},
pages = {218--223},
month = {March}
}
```

### Contributions

Thanks to [@vblagoje](https://github.com/vblagoje) for adding this dataset.
10 changes: 9 additions & 1 deletion datasets/conllpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,15 @@ The data fields are the same among all splits.

### Citation Information

[More Information Needed]
```
@inproceedings{wang2019crossweigh,
title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations},
author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={5157--5166},
year={2019}
}
```

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/datacommons_factcheck/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ All fact checked items are released under a `CC-BY-NC-4.0` License.

### Citation Information

[More Information Needed]
Data Commons 2020, Fact Checks, electronic dataset, Data Commons, viewed 16 Dec 2020, <https://datacommons.org>.

### Contributions

Expand Down
14 changes: 13 additions & 1 deletion datasets/gnad10/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,19 @@ This dataset is licensed under the Creative Commons Attribution-NonCommercial-Sh

### Citation Information

[More Information Needed]
Please consider citing the authors of the "One Million Post Corpus" if you use the dataset.:
```
@InProceedings{Schabus2017,
Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp},
Title = {One Million Posts: A Data Set of German Online Discussions},
Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)},
Pages = {1241--1244},
Year = {2017},
Address = {Tokyo, Japan},
Doi = {10.1145/3077136.3080711},
Month = aug
}
```

### Contributions

Expand Down
9 changes: 8 additions & 1 deletion datasets/id_panl_bppt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,14 @@ The dataset is splitted in to train, validation and test sets.

### Citation Information

[More Information Needed]
```
@inproceedings{id_panl_bppt,
author = {PAN Localization - BPPT},
title = {Parallel Text Corpora, English Indonesian},
year = {2009},
url = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf},
}
```

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/jigsaw_toxicity_pred/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ The "Toxic Comment Classification" dataset is released under [CC0], with the und

### Citation Information

[More Information Needed]
No citation information.

### Contributions

Expand Down
9 changes: 8 additions & 1 deletion datasets/kinnews_kirnews/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,14 @@ Lang| Train | Test |

### Citation Information

[More Information Needed]
```
@article{niyongabo2020kinnews,
title={KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi},
author={Niyongabo, Rubungo Andre and Qu, Hong and Kreutzer, Julia and Huang, Li},
journal={arXiv preprint arXiv:2010.12174},
year={2020}
}
```

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/kor_sarcasm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ This dataset is licensed under the MIT License.

### Citation Information

[More Information Needed]
Unknown citation information: https://github.com/SpellOnYou/korean-sarcasm

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/makhzan/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ Zeerak Ahmed

### Citation Information

[More Information Needed]
No citation information.

### Contributions

Expand Down
10 changes: 8 additions & 2 deletions datasets/reasoning_bg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,14 @@ Data has been sourced from the matriculation exams and online quizzes.

### Citation Information

[Needs More Information]

```
@article{hardalov2019beyond,
title={Beyond english-only reading comprehension: Experiments in zero-shot multilingual transfer for bulgarian},
author={Hardalov, Momchil and Koychev, Ivan and Nakov, Preslav},
journal={arXiv preprint arXiv:1908.01519},
year={2019}
}
```

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/ro_sts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ CC BY-SA 4.0 License

### Citation Information

[Needs More Information]
Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation

### Contributions

Expand Down
2 changes: 1 addition & 1 deletion datasets/ro_sts_parallel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ CC BY-SA 4.0 License

### Citation Information

[Needs More Information]
Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation

### Contributions

Expand Down
9 changes: 8 additions & 1 deletion datasets/sanskrit_classic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,14 @@ Sanskrit

### Citation Information

[More Information Needed]
```
@Misc{johnsonetal2014,
author = {Johnson, Kyle P. and Patrick Burns and John Stewart and Todd Cook},
title = {CLTK: The Classical Language Toolkit},
url = {https://github.com/cltk/cltk},
year = {2014--2020},
}
```

### Contributions

Expand Down
8 changes: 7 additions & 1 deletion datasets/telugu_news/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,13 @@ Sudalai Rajkumar, Anusha Motamarri

### Citation Information

[More Information Needed]
```
@InProceedings{kaggle:dataset,
title = {Telugu News - Natural Language Processing for Indian Languages},
authors={Sudalai Rajkumar, Anusha Motamarri},
year={2019}
}
```

### Contributions

Expand Down
4 changes: 3 additions & 1 deletion datasets/thaiqa_squad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,9 @@ CC-BY-NC-SA 3.0

### Citation Information

[More Information Needed]
No clear citation guidelines from source: https://aiforthai.in.th/corpus.php

SQuAD version: https://github.com/PyThaiNLP/thaiqa_squad

### Contributions

Expand Down
10 changes: 9 additions & 1 deletion datasets/wiki_movies/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,15 @@ WikiMovies was built with the following goals in mind: (i) machine learning tech

### Citation Information

[More Information Needed]
```
@misc{miller2016keyvalue,
title={Key-Value Memory Networks for Directly Reading Documents},
author={Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston},
year={2016},
eprint={1606.03126},
archivePrefix={arXiv},
primaryClass={cs.CL}
```

### Contributions

Expand Down