huggingface · albertvillanova · Aug 24, 2022 · Aug 23, 2022
diff --git a/datasets/cc_news/README.md b/datasets/cc_news/README.md
@@ -63,7 +63,7 @@ It represents a small portion of the English language subset of the CC-News data
 
 ### Supported Tasks and Leaderboards
 
-[N/A]
+CC-News has been mostly used for language model training.
 
 ### Languages
 
@@ -113,7 +113,9 @@ CC-News dataset has only the training set, i.e. it has to be loaded with `train`
 
 ## Dataset Creation
 
-CC-News has been mostly used for language model training.
+### Curation Rationale
+
+[More Information Needed]
 
 ### Source Data
 
@@ -176,4 +178,19 @@ The purpose of this dataset is to help language model researchers develop better
 
 ### Citation Information
 
-[More Information Needed]
+```
+@InProceedings{Hamborg2017,
+  author     = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
+  title      = {news-please: A Generic News Crawler and Extractor},
+  year       = {2017},
+  booktitle  = {Proceedings of the 15th International Symposium of Information Science},
+  location   = {Berlin},
+  doi        = {10.5281/zenodo.4120316},
+  pages      = {218--223},
+  month      = {March}
+}
+```
+
+### Contributions
+
+Thanks to [@vblagoje](https://github.com/vblagoje) for adding this dataset.
diff --git a/datasets/conllpp/README.md b/datasets/conllpp/README.md
@@ -186,7 +186,15 @@ The data fields are the same among all splits.
 
 ### Citation Information
 
-[More Information Needed]
+```
+@inproceedings{wang2019crossweigh,
+  title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations},
+  author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei},
+  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
+  pages={5157--5166},
+  year={2019}
+}
+```
 
 ### Contributions
 

diff --git a/datasets/datacommons_factcheck/README.md b/datasets/datacommons_factcheck/README.md
@@ -163,7 +163,7 @@ All fact checked items are released under a `CC-BY-NC-4.0` License.
 
 ### Citation Information
 
-[More Information Needed]
+Data Commons 2020, Fact Checks, electronic dataset, Data Commons, viewed 16 Dec 2020, <https://datacommons.org>.
 
 ### Contributions
 

diff --git a/datasets/gnad10/README.md b/datasets/gnad10/README.md
@@ -150,7 +150,19 @@ This dataset is licensed under the Creative Commons Attribution-NonCommercial-Sh
 
 ### Citation Information
 
-[More Information Needed]
+Please consider citing the authors of the "One Million Post Corpus" if you use the dataset.:
+```
+@InProceedings{Schabus2017,
+  Author    = {Dietmar Schabus and Marcin Skowron and Martin Trapp},
+  Title     = {One Million Posts: A Data Set of German Online Discussions},
+  Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)},
+  Pages     = {1241--1244},
+  Year      = {2017},
+  Address   = {Tokyo, Japan},
+  Doi       = {10.1145/3077136.3080711},
+  Month     = aug
+}
+```
 
 ### Contributions
 

diff --git a/datasets/id_panl_bppt/README.md b/datasets/id_panl_bppt/README.md
@@ -161,7 +161,14 @@ The dataset is splitted in to train, validation and test sets.
 
 ### Citation Information
 
-[More Information Needed]
+```
+@inproceedings{id_panl_bppt,
+  author    = {PAN Localization - BPPT},
+  title     = {Parallel Text Corpora, English Indonesian},
+  year      = {2009},
+  url       = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf},
+}
+```
 
 ### Contributions
 

diff --git a/datasets/jigsaw_toxicity_pred/README.md b/datasets/jigsaw_toxicity_pred/README.md
@@ -202,7 +202,7 @@ The "Toxic Comment Classification" dataset is released under [CC0], with the und
 
 ### Citation Information
 
-[More Information Needed]
+No citation information.
 
 ### Contributions
 

diff --git a/datasets/kinnews_kirnews/README.md b/datasets/kinnews_kirnews/README.md
@@ -183,7 +183,14 @@ Lang| Train | Test |
 
 ### Citation Information
 
-[More Information Needed]
+```
+@article{niyongabo2020kinnews,
+  title={KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi},
+  author={Niyongabo, Rubungo Andre and Qu, Hong and Kreutzer, Julia and Huang, Li},
+  journal={arXiv preprint arXiv:2010.12174},
+  year={2020}
+}
+```
 
 ### Contributions
 

diff --git a/datasets/kor_sarcasm/README.md b/datasets/kor_sarcasm/README.md
@@ -143,7 +143,7 @@ This dataset is licensed under the MIT License.
 
 ### Citation Information
 
-[More Information Needed]
+Unknown citation information: https://github.com/SpellOnYou/korean-sarcasm
 
 ### Contributions
 

diff --git a/datasets/makhzan/README.md b/datasets/makhzan/README.md
@@ -222,7 +222,7 @@ Zeerak Ahmed
 
 ### Citation Information
 
-[More Information Needed]
+No citation information.
 
 ### Contributions
 

diff --git a/datasets/reasoning_bg/README.md b/datasets/reasoning_bg/README.md
@@ -170,8 +170,14 @@ Data has been sourced from the matriculation exams and online quizzes.
 
 ### Citation Information
 
-[Needs More Information]
-
+```
+@article{hardalov2019beyond,
+  title={Beyond english-only reading comprehension: Experiments in zero-shot multilingual transfer for bulgarian},
+  author={Hardalov, Momchil and Koychev, Ivan and Nakov, Preslav},
+  journal={arXiv preprint arXiv:1908.01519},
+  year={2019}
+}
+```
 
 ### Contributions
 

diff --git a/datasets/ro_sts/README.md b/datasets/ro_sts/README.md
@@ -145,7 +145,7 @@ CC BY-SA 4.0 License
 
 ### Citation Information
 
-[Needs More Information]
+Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation
 
 ### Contributions
 

diff --git a/datasets/ro_sts_parallel/README.md b/datasets/ro_sts_parallel/README.md
@@ -142,7 +142,7 @@ CC BY-SA 4.0 License
 
 ### Citation Information
 
-[Needs More Information]
+Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation
 
 ### Contributions
 

diff --git a/datasets/sanskrit_classic/README.md b/datasets/sanskrit_classic/README.md
@@ -141,7 +141,14 @@ Sanskrit
 
 ### Citation Information
 
-[More Information Needed]
+```
+@Misc{johnsonetal2014,
+ author = {Johnson, Kyle P. and Patrick Burns and John Stewart and Todd Cook},
+ title = {CLTK: The Classical Language Toolkit},
+ url = {https://github.com/cltk/cltk},
+ year = {2014--2020},
+}
+```
 
 ### Contributions
 

diff --git a/datasets/telugu_news/README.md b/datasets/telugu_news/README.md
@@ -151,7 +151,13 @@ Sudalai Rajkumar, Anusha Motamarri
 
 ### Citation Information
 
-[More Information Needed]
+```
+@InProceedings{kaggle:dataset,
+title = {Telugu News - Natural Language Processing for Indian Languages},
+authors={Sudalai Rajkumar, Anusha Motamarri},
+year={2019}
+}
+```
 
 ### Contributions
 

diff --git a/datasets/thaiqa_squad/README.md b/datasets/thaiqa_squad/README.md
@@ -162,7 +162,9 @@ CC-BY-NC-SA 3.0
 
 ### Citation Information
 
-[More Information Needed]
+No clear citation guidelines from source: https://aiforthai.in.th/corpus.php
+
+SQuAD version: https://github.com/PyThaiNLP/thaiqa_squad
 
 ### Contributions
 

diff --git a/datasets/wiki_movies/README.md b/datasets/wiki_movies/README.md
@@ -156,7 +156,15 @@ WikiMovies was built with the following goals in mind: (i) machine learning tech
 
 ### Citation Information
 
-[More Information Needed]
+```
+@misc{miller2016keyvalue,
+      title={Key-Value Memory Networks for Directly Reading Documents},
+      author={Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston},
+      year={2016},
+      eprint={1606.03126},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+```
 
 ### Contributions