Skip to content

Commit adfbc76

Browse files
committed
Merge branch 'main' into dataset_infos-in-yaml
2 parents 3c940ca + 50bc312 commit adfbc76

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+2485
-742
lines changed

.github/hub/update_hub_repositories.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -194,13 +194,8 @@ def __call__(self, dataset_name: str) -> bool:
194194
commit_args += (f"-m Commit from {DATASETS_LIB_COMMIT_URL.format(hexsha=current_commit.hexsha)}",)
195195
commit_args += (f"--author={author_name} <{author_email}>",)
196196

197-
for _tag in datasets_lib_repo.tags:
198-
# Add a new tag if this is a `datasets` release
199-
if _tag.commit == current_commit and re.match(r"^[0-9]+\.[0-9]+\.[0-9]+$", _tag.name):
200-
new_tag = _tag
201-
break
202-
else:
203-
new_tag = None
197+
# we don't add a new tag as we used to when there's a release
198+
new_tag = None
204199

205200
changed_files_since_last_commit = [
206201
path

.github/workflows/ci.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,13 @@ jobs:
7272
- name: Test with pytest
7373
run: |
7474
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
75+
- name: Install dependencies to test torchaudio>=0.12 on Ubuntu
76+
if: ${{ matrix.os == 'ubuntu-latest' }}
77+
run: |
78+
pip uninstall -y torchaudio torch
79+
pip install "torchaudio>=0.12"
80+
sudo apt-get -y install ffmpeg
81+
- name: Test torchaudio>=0.12 on Ubuntu
82+
if: ${{ matrix.os == 'ubuntu-latest' }}
83+
run: |
84+
python -m pytest -rfExX -m torchaudio_latest -n 2 --dist loadfile -sv ./tests/features/test_audio.py

datasets/aeslc/README.md

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,27 @@
11
---
2+
annotations_creators:
3+
- crowdsourced
24
language:
35
- en
4-
paperswithcode_id: aeslc
5-
pretty_name: AESLC
6+
language_creators:
7+
- found
8+
license:
9+
- unknown
10+
multilinguality:
11+
- monolingual
12+
pretty_name: "AESLC: Annotated Enron Subject Line Corpus"
13+
size_categories:
14+
- 10K<n<100K
15+
source_datasets:
16+
- original
617
task_categories:
718
- summarization
819
task_ids:
920
- summarization-other-email-headline-generation
1021
- summarization-other-conversations-summarization
1122
- summarization-other-multi-document-summarization
1223
- summarization-other-aspect-based-summarization
24+
paperswithcode_id: aeslc
1325
---
1426

1527
# Dataset Card for "aeslc"
@@ -40,9 +52,9 @@ task_ids:
4052

4153
## Dataset Description
4254

43-
- **Homepage:** [https://github.com/ryanzhumich/AESLC](https://github.com/ryanzhumich/AESLC)
44-
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
45-
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
55+
- **Homepage:**
56+
- **Repository:** https://github.com/ryanzhumich/AESLC
57+
- **Paper:** [This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation](https://arxiv.org/abs/1906.03497)
4658
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
4759
- **Size of downloaded dataset files:** 11.10 MB
4860
- **Size of the generated dataset:** 14.26 MB
@@ -153,19 +165,21 @@ The data fields are the same among all splits.
153165
### Citation Information
154166

155167
```
156-
157-
@misc{zhang2019email,
158-
title={This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation},
159-
author={Rui Zhang and Joel Tetreault},
160-
year={2019},
161-
eprint={1906.03497},
162-
archivePrefix={arXiv},
163-
primaryClass={cs.CL}
168+
@inproceedings{zhang-tetreault-2019-email,
169+
title = "This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation",
170+
author = "Zhang, Rui and
171+
Tetreault, Joel",
172+
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
173+
month = jul,
174+
year = "2019",
175+
address = "Florence, Italy",
176+
publisher = "Association for Computational Linguistics",
177+
url = "https://aclanthology.org/P19-1043",
178+
doi = "10.18653/v1/P19-1043",
179+
pages = "446--456",
164180
}
165-
166181
```
167182

168-
169183
### Contributions
170184

171185
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun) for adding this dataset.

datasets/amazon_us_reviews/README.md

Lines changed: 47 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,32 @@
11
---
2+
annotations_creators:
3+
- no-annotation
24
language:
35
- en
6+
language_creators:
7+
- found
8+
license:
9+
- other
10+
multilinguality:
11+
- monolingual
12+
pretty_name: Amazon US Reviews
13+
size_categories:
14+
- 100M<n<1B
15+
source_datasets:
16+
- original
17+
task_categories:
18+
- summarization
19+
- text-generation
20+
- fill-mask
21+
- text-classification
22+
task_ids:
23+
- text-scoring
24+
- language-modeling
25+
- masked-language-modeling
26+
- sentiment-classification
27+
- sentiment-scoring
28+
- topic-classification
429
paperswithcode_id: null
5-
pretty_name: AmazonUsReviews
630
---
731

832
# Dataset Card for "amazon_us_reviews"
@@ -407,14 +431,31 @@ The data fields are the same among all splits.
407431

408432
### Licensing Information
409433

410-
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
434+
https://s3.amazonaws.com/amazon-reviews-pds/LICENSE.txt
435+
436+
By accessing the Amazon Customer Reviews Library ("Reviews Library"), you agree that the
437+
Reviews Library is an Amazon Service subject to the [Amazon.com Conditions of Use](https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088)
438+
and you agree to be bound by them, with the following additional conditions:
439+
440+
In addition to the license rights granted under the Conditions of Use,
441+
Amazon or its content providers grant you a limited, non-exclusive, non-transferable,
442+
non-sublicensable, revocable license to access and use the Reviews Library
443+
for purposes of academic research.
444+
You may not resell, republish, or make any commercial use of the Reviews Library
445+
or its contents, including use of the Reviews Library for commercial research,
446+
such as research related to a funding or consultancy contract, internship, or
447+
other relationship in which the results are provided for a fee or delivered
448+
to a for-profit organization. You may not (a) link or associate content
449+
in the Reviews Library with any personal information (including Amazon customer accounts),
450+
or (b) attempt to determine the identity of the author of any content in the
451+
Reviews Library.
452+
If you violate any of the foregoing conditions, your license to access and use the
453+
Reviews Library will automatically terminate without prejudice to any of the
454+
other rights or remedies Amazon may have.
411455

412456
### Citation Information
413457

414-
```
415-
416-
```
417-
458+
No citation information.
418459

419460
### Contributions
420461

datasets/art/README.md

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,26 @@
11
---
2+
annotations_creators:
3+
- crowdsourced
24
language:
35
- en
4-
paperswithcode_id: art-dataset
6+
language_creators:
7+
- found
8+
license:
9+
- unknown
10+
multilinguality:
11+
- monolingual
512
pretty_name: Abductive Reasoning in narrative Text
13+
size_categories:
14+
- 100K<n<1M
15+
source_datasets:
16+
- original
17+
task_categories:
18+
- multiple-choice
19+
- text-classification
20+
task_ids:
21+
- natural-language-inference
22+
- text-classification-other-abductive-natural-language-inference
23+
paperswithcode_id: art-dataset
624
---
725

826
# Dataset Card for "art"
@@ -34,16 +52,18 @@ pretty_name: Abductive Reasoning in narrative Text
3452
## Dataset Description
3553

3654
- **Homepage:** [https://leaderboard.allenai.org/anli/submissions/get-started](https://leaderboard.allenai.org/anli/submissions/get-started)
37-
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
38-
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
55+
- **Repository:** https://github.com/allenai/abductive-commonsense-reasoning
56+
- **Paper:** [Abductive Commonsense Reasoning](https://arxiv.org/abs/1908.05739)
3957
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
4058
- **Size of downloaded dataset files:** 4.88 MB
4159
- **Size of the generated dataset:** 32.77 MB
4260
- **Total amount of disk used:** 37.65 MB
4361

4462
### Dataset Summary
4563

46-
the Abductive Natural Language Inference Dataset from AI2
64+
ART consists of over 20k commonsense narrative contexts and 200k explanations.
65+
66+
The Abductive Natural Language Inference Dataset from AI2.
4767

4868
### Supported Tasks and Leaderboards
4969

@@ -55,8 +75,6 @@ the Abductive Natural Language Inference Dataset from AI2
5575

5676
## Dataset Structure
5777

58-
`
59-
6078
### Data Instances
6179

6280
#### anli
@@ -150,22 +168,15 @@ The data fields are the same among all splits.
150168
### Citation Information
151169

152170
```
153-
@InProceedings{anli,
154-
author = "Chandra, Bhagavatula
155-
and Ronan, Le Bras
156-
and Chaitanya, Malaviya
157-
and Keisuke, Sakaguchi
158-
and Ari, Holtzman
159-
and Hannah, Rashkin
160-
and Doug, Downey
161-
and Scott, Wen-tau Yih
162-
and Yejin, Choi",
163-
title = "Abductive Commonsense Reasoning",
164-
year = "2020",
171+
@inproceedings{Bhagavatula2020Abductive,
172+
title={Abductive Commonsense Reasoning},
173+
author={Chandra Bhagavatula and Ronan Le Bras and Chaitanya Malaviya and Keisuke Sakaguchi and Ari Holtzman and Hannah Rashkin and Doug Downey and Wen-tau Yih and Yejin Choi},
174+
booktitle={International Conference on Learning Representations},
175+
year={2020},
176+
url={https://openreview.net/forum?id=Byg1v1HKDB}
165177
}
166178
```
167179

168-
169180
### Contributions
170181

171182
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@lhoestq](https://github.com/lhoestq) for adding this dataset.

datasets/discofuse/README.md

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,24 @@
11
---
2+
annotations_creators:
3+
- machine-generated
24
language:
35
- en
4-
paperswithcode_id: discofuse
6+
language_creators:
7+
- found
8+
license:
9+
- cc-by-sa-3.0
10+
multilinguality:
11+
- monolingual
512
pretty_name: DiscoFuse
13+
size_categories:
14+
- 10M<n<100M
15+
source_datasets:
16+
- original
17+
task_categories:
18+
- text2text-generation
19+
task_ids:
20+
- text2text-generation-other-sentence-fusion
21+
paperswithcode_id: discofuse
622
---
723

824
# Dataset Card for "discofuse"
@@ -33,17 +49,16 @@ pretty_name: DiscoFuse
3349

3450
## Dataset Description
3551

36-
- **Homepage:** [https://github.com/google-research-datasets/discofuse](https://github.com/google-research-datasets/discofuse)
37-
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
38-
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
52+
- **Repository:** https://github.com/google-research-datasets/discofuse
53+
- **Paper:** [DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion](https://arxiv.org/abs/1902.10526)
3954
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
4055
- **Size of downloaded dataset files:** 5764.06 MB
4156
- **Size of the generated dataset:** 20547.64 MB
4257
- **Total amount of disk used:** 26311.70 MB
4358

4459
### Dataset Summary
4560

46-
DISCOFUSE is a large scale dataset for discourse-based sentence fusion.
61+
DiscoFuse is a large scale dataset for discourse-based sentence fusion.
4762

4863
### Supported Tasks and Leaderboards
4964

@@ -180,7 +195,7 @@ The data fields are the same among all splits.
180195

181196
### Licensing Information
182197

183-
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
198+
The data is licensed under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license.
184199

185200
### Citation Information
186201

@@ -192,10 +207,8 @@ The data fields are the same among all splits.
192207
note = {arXiv preprint arXiv:1902.10526},
193208
year = {2019}
194209
}
195-
196210
```
197211

198-
199212
### Contributions
200213

201214
Thanks to [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun) for adding this dataset.

0 commit comments

Comments
 (0)