From 935abf3d04b4973bf10900a727db59626b426ea4 Mon Sep 17 00:00:00 2001 From: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue, 23 Aug 2022 19:07:18 +0200 Subject: [PATCH] Fix Citation Information section in dataset cards --- datasets/cc_news/README.md | 23 ++++++++++++++++++++--- datasets/conllpp/README.md | 10 +++++++++- datasets/datacommons_factcheck/README.md | 2 +- datasets/gnad10/README.md | 14 +++++++++++++- datasets/id_panl_bppt/README.md | 9 ++++++++- datasets/jigsaw_toxicity_pred/README.md | 2 +- datasets/kinnews_kirnews/README.md | 9 ++++++++- datasets/kor_sarcasm/README.md | 2 +- datasets/makhzan/README.md | 2 +- datasets/reasoning_bg/README.md | 10 ++++++++-- datasets/ro_sts/README.md | 2 +- datasets/ro_sts_parallel/README.md | 2 +- datasets/sanskrit_classic/README.md | 9 ++++++++- datasets/telugu_news/README.md | 8 +++++++- datasets/thaiqa_squad/README.md | 4 +++- datasets/wiki_movies/README.md | 10 +++++++++- 16 files changed, 99 insertions(+), 19 deletions(-) diff --git a/datasets/cc_news/README.md b/datasets/cc_news/README.md index e49cf585923..7aad22d1766 100644 --- a/datasets/cc_news/README.md +++ b/datasets/cc_news/README.md @@ -63,7 +63,7 @@ It represents a small portion of the English language subset of the CC-News data ### Supported Tasks and Leaderboards -[N/A] +CC-News has been mostly used for language model training. ### Languages @@ -113,7 +113,9 @@ CC-News dataset has only the training set, i.e. it has to be loaded with `train` ## Dataset Creation -CC-News has been mostly used for language model training. +### Curation Rationale + +[More Information Needed] ### Source Data @@ -176,4 +178,19 @@ The purpose of this dataset is to help language model researchers develop better ### Citation Information -[More Information Needed] +``` +@InProceedings{Hamborg2017, + author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela}, + title = {news-please: A Generic News Crawler and Extractor}, + year = {2017}, + booktitle = {Proceedings of the 15th International Symposium of Information Science}, + location = {Berlin}, + doi = {10.5281/zenodo.4120316}, + pages = {218--223}, + month = {March} +} +``` + +### Contributions + +Thanks to [@vblagoje](https://github.com/vblagoje) for adding this dataset. diff --git a/datasets/conllpp/README.md b/datasets/conllpp/README.md index 4c59fe19848..e1c46daea70 100644 --- a/datasets/conllpp/README.md +++ b/datasets/conllpp/README.md @@ -186,7 +186,15 @@ The data fields are the same among all splits. ### Citation Information -[More Information Needed] +``` +@inproceedings{wang2019crossweigh, + title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations}, + author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei}, + booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, + pages={5157--5166}, + year={2019} +} +``` ### Contributions diff --git a/datasets/datacommons_factcheck/README.md b/datasets/datacommons_factcheck/README.md index 99bbcca0652..0858a6bf03f 100644 --- a/datasets/datacommons_factcheck/README.md +++ b/datasets/datacommons_factcheck/README.md @@ -163,7 +163,7 @@ All fact checked items are released under a `CC-BY-NC-4.0` License. ### Citation Information -[More Information Needed] +Data Commons 2020, Fact Checks, electronic dataset, Data Commons, viewed 16 Dec 2020, . ### Contributions diff --git a/datasets/gnad10/README.md b/datasets/gnad10/README.md index 092dbd493ac..1b865fdd54c 100644 --- a/datasets/gnad10/README.md +++ b/datasets/gnad10/README.md @@ -150,7 +150,19 @@ This dataset is licensed under the Creative Commons Attribution-NonCommercial-Sh ### Citation Information -[More Information Needed] +Please consider citing the authors of the "One Million Post Corpus" if you use the dataset.: +``` +@InProceedings{Schabus2017, + Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, + Title = {One Million Posts: A Data Set of German Online Discussions}, + Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, + Pages = {1241--1244}, + Year = {2017}, + Address = {Tokyo, Japan}, + Doi = {10.1145/3077136.3080711}, + Month = aug +} +``` ### Contributions diff --git a/datasets/id_panl_bppt/README.md b/datasets/id_panl_bppt/README.md index 7b4258f085e..3e62358e513 100644 --- a/datasets/id_panl_bppt/README.md +++ b/datasets/id_panl_bppt/README.md @@ -161,7 +161,14 @@ The dataset is splitted in to train, validation and test sets. ### Citation Information -[More Information Needed] +``` +@inproceedings{id_panl_bppt, + author = {PAN Localization - BPPT}, + title = {Parallel Text Corpora, English Indonesian}, + year = {2009}, + url = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf}, +} +``` ### Contributions diff --git a/datasets/jigsaw_toxicity_pred/README.md b/datasets/jigsaw_toxicity_pred/README.md index 9d2ab0489f4..55d87a02df8 100644 --- a/datasets/jigsaw_toxicity_pred/README.md +++ b/datasets/jigsaw_toxicity_pred/README.md @@ -202,7 +202,7 @@ The "Toxic Comment Classification" dataset is released under [CC0], with the und ### Citation Information -[More Information Needed] +No citation information. ### Contributions diff --git a/datasets/kinnews_kirnews/README.md b/datasets/kinnews_kirnews/README.md index 4dcfc281526..3c69501f06d 100644 --- a/datasets/kinnews_kirnews/README.md +++ b/datasets/kinnews_kirnews/README.md @@ -183,7 +183,14 @@ Lang| Train | Test | ### Citation Information -[More Information Needed] +``` +@article{niyongabo2020kinnews, + title={KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi}, + author={Niyongabo, Rubungo Andre and Qu, Hong and Kreutzer, Julia and Huang, Li}, + journal={arXiv preprint arXiv:2010.12174}, + year={2020} +} +``` ### Contributions diff --git a/datasets/kor_sarcasm/README.md b/datasets/kor_sarcasm/README.md index 7e541e82c6c..b40a2740d60 100644 --- a/datasets/kor_sarcasm/README.md +++ b/datasets/kor_sarcasm/README.md @@ -143,7 +143,7 @@ This dataset is licensed under the MIT License. ### Citation Information -[More Information Needed] +Unknown citation information: https://github.com/SpellOnYou/korean-sarcasm ### Contributions diff --git a/datasets/makhzan/README.md b/datasets/makhzan/README.md index 2ffbaa53d57..ab84102a366 100644 --- a/datasets/makhzan/README.md +++ b/datasets/makhzan/README.md @@ -222,7 +222,7 @@ Zeerak Ahmed ### Citation Information -[More Information Needed] +No citation information. ### Contributions diff --git a/datasets/reasoning_bg/README.md b/datasets/reasoning_bg/README.md index 7dcf8ac451d..2f3200a403d 100644 --- a/datasets/reasoning_bg/README.md +++ b/datasets/reasoning_bg/README.md @@ -170,8 +170,14 @@ Data has been sourced from the matriculation exams and online quizzes. ### Citation Information -[Needs More Information] - +``` +@article{hardalov2019beyond, + title={Beyond english-only reading comprehension: Experiments in zero-shot multilingual transfer for bulgarian}, + author={Hardalov, Momchil and Koychev, Ivan and Nakov, Preslav}, + journal={arXiv preprint arXiv:1908.01519}, + year={2019} +} +``` ### Contributions diff --git a/datasets/ro_sts/README.md b/datasets/ro_sts/README.md index 58b1e9da5dd..c89e5ce9efa 100644 --- a/datasets/ro_sts/README.md +++ b/datasets/ro_sts/README.md @@ -145,7 +145,7 @@ CC BY-SA 4.0 License ### Citation Information -[Needs More Information] +Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation ### Contributions diff --git a/datasets/ro_sts_parallel/README.md b/datasets/ro_sts_parallel/README.md index 847eae77c16..a50a2a771ff 100755 --- a/datasets/ro_sts_parallel/README.md +++ b/datasets/ro_sts_parallel/README.md @@ -142,7 +142,7 @@ CC BY-SA 4.0 License ### Citation Information -[Needs More Information] +Reference coming soon: https://github.com/dumitrescustefan/RO-STS#citation ### Contributions diff --git a/datasets/sanskrit_classic/README.md b/datasets/sanskrit_classic/README.md index 3619b96a9c0..625b3a3e2e2 100644 --- a/datasets/sanskrit_classic/README.md +++ b/datasets/sanskrit_classic/README.md @@ -141,7 +141,14 @@ Sanskrit ### Citation Information -[More Information Needed] +``` +@Misc{johnsonetal2014, + author = {Johnson, Kyle P. and Patrick Burns and John Stewart and Todd Cook}, + title = {CLTK: The Classical Language Toolkit}, + url = {https://github.com/cltk/cltk}, + year = {2014--2020}, +} +``` ### Contributions diff --git a/datasets/telugu_news/README.md b/datasets/telugu_news/README.md index af7853480ea..4fb62781a02 100644 --- a/datasets/telugu_news/README.md +++ b/datasets/telugu_news/README.md @@ -151,7 +151,13 @@ Sudalai Rajkumar, Anusha Motamarri ### Citation Information -[More Information Needed] +``` +@InProceedings{kaggle:dataset, +title = {Telugu News - Natural Language Processing for Indian Languages}, +authors={Sudalai Rajkumar, Anusha Motamarri}, +year={2019} +} +``` ### Contributions diff --git a/datasets/thaiqa_squad/README.md b/datasets/thaiqa_squad/README.md index 4f38c9b0f28..c59b6b97a40 100644 --- a/datasets/thaiqa_squad/README.md +++ b/datasets/thaiqa_squad/README.md @@ -162,7 +162,9 @@ CC-BY-NC-SA 3.0 ### Citation Information -[More Information Needed] +No clear citation guidelines from source: https://aiforthai.in.th/corpus.php + +SQuAD version: https://github.com/PyThaiNLP/thaiqa_squad ### Contributions diff --git a/datasets/wiki_movies/README.md b/datasets/wiki_movies/README.md index 33e5196d540..56c03e777ea 100644 --- a/datasets/wiki_movies/README.md +++ b/datasets/wiki_movies/README.md @@ -156,7 +156,15 @@ WikiMovies was built with the following goals in mind: (i) machine learning tech ### Citation Information -[More Information Needed] +``` +@misc{miller2016keyvalue, + title={Key-Value Memory Networks for Directly Reading Documents}, + author={Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston}, + year={2016}, + eprint={1606.03126}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +``` ### Contributions