Skip to content

Commit 5abd91c

Browse files
committed
Merge branch 'main' of github.com:huggingface/datasets into fix-4591
2 parents 88a1112 + e662d75 commit 5abd91c

File tree

817 files changed

+2613
-2469
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

817 files changed

+2613
-2469
lines changed

.circleci/config.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
- run: pip install .[tests]
2020
- run: pip install -r additional-tests-requirements.txt --no-deps
2121
- run: pip install pyarrow --upgrade
22-
- run: HF_SCRIPTS_VERSION=master HF_ALLOW_CODE_EVAL=1 python -m pytest -d --tx 2*popen//python=python3.6 --dist loadfile -sv ./tests/
22+
- run: HF_SCRIPTS_VERSION=main HF_ALLOW_CODE_EVAL=1 python -m pytest -d --tx 2*popen//python=python3.6 --dist loadfile -sv ./tests/
2323

2424
run_dataset_script_tests_pyarrow_6:
2525
working_directory: ~/datasets
@@ -36,7 +36,7 @@ jobs:
3636
- run: pip install .[tests]
3737
- run: pip install -r additional-tests-requirements.txt --no-deps
3838
- run: pip install pyarrow==6.0.0
39-
- run: HF_SCRIPTS_VERSION=master HF_ALLOW_CODE_EVAL=1 python -m pytest -d --tx 2*popen//python=python3.6 --dist loadfile -sv ./tests/
39+
- run: HF_SCRIPTS_VERSION=main HF_ALLOW_CODE_EVAL=1 python -m pytest -d --tx 2*popen//python=python3.6 --dist loadfile -sv ./tests/
4040

4141
run_dataset_script_tests_pyarrow_latest_WIN:
4242
working_directory: ~/datasets
@@ -56,7 +56,7 @@ jobs:
5656
pip install pyarrow --upgrade
5757
- run: |
5858
conda activate py37
59-
$env:HF_SCRIPTS_VERSION="master"
59+
$env:HF_SCRIPTS_VERSION="main"
6060
python -m pytest -n 2 --dist loadfile -sv ./tests/
6161
6262
run_dataset_script_tests_pyarrow_6_WIN:
@@ -77,7 +77,7 @@ jobs:
7777
pip install pyarrow==6.0.0
7878
- run: |
7979
conda activate py37
80-
$env:HF_SCRIPTS_VERSION="master"
80+
$env:HF_SCRIPTS_VERSION="main"
8181
python -m pytest -n 2 --dist loadfile -sv ./tests/
8282
8383
check_code_quality:

.github/ISSUE_TEMPLATE/add-dataset.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ assignees: ''
1414
- **Data:** *link to the Github repository or current dataset location*
1515
- **Motivation:** *what are some good reasons to have this dataset*
1616

17-
Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).
17+
Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).

.github/workflows/benchmarks.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
dvc repro --force
2323
2424
git fetch --prune
25-
dvc metrics diff --show-json master > report.json
25+
dvc metrics diff --show-json main > report.json
2626
2727
python ./benchmarks/format.py report.json report.md
2828
@@ -35,7 +35,7 @@ jobs:
3535
dvc repro --force
3636
3737
git fetch --prune
38-
dvc metrics diff --show-json master > report.json
38+
dvc metrics diff --show-json main > report.json
3939
4040
python ./benchmarks/format.py report.json report.md
4141

.github/workflows/build_documentation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Build documentation
33
on:
44
push:
55
branches:
6-
- master
6+
- main
77
- doc-builder*
88
- v*-release
99

.github/workflows/test-audio.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Test audio
33
on:
44
pull_request:
55
branches:
6-
- master
6+
- main
77

88
jobs:
99
test:
@@ -27,4 +27,4 @@ jobs:
2727
pip install pyarrow --upgrade
2828
- name: Test audio with pytest
2929
run: |
30-
HF_SCRIPTS_VERSION=master python -m pytest -n 2 -sv ./tests/features/test_audio.py
30+
HF_SCRIPTS_VERSION=main python -m pytest -n 2 -sv ./tests/features/test_audio.py

.github/workflows/update-hub-repositories.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Update Hub repositories
33
on:
44
push:
55
branches:
6-
- master
6+
- main
77

88
jobs:
99
update-hub-repositories:

ADD_NEW_DATASET.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,11 @@ You are now ready to start the process of adding the dataset. We will create the
7070

7171
```bash
7272
git fetch upstream
73-
git rebase upstream/master
73+
git rebase upstream/main
7474
git checkout -b a-descriptive-name-for-my-changes
7575
```
7676

77-
**Do not** work on the `master` branch.
77+
**Do not** work on the `main` branch.
7878

7979
3. Create your dataset folder under `datasets/<your_dataset_name>`:
8080

@@ -96,9 +96,9 @@ You are now ready to start the process of adding the dataset. We will create the
9696
- Download/open the data to see how it looks like
9797
- While you explore and read about the dataset, you can complete some sections of the dataset card (the online form or the one you have just created at `./datasets/<your_dataset_name>/README.md`). You can just copy the information you meet in your readings in the relevant sections of the dataset card (typically in `Dataset Description`, `Dataset Structure` and `Dataset Creation`).
9898

99-
If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md.
99+
If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md.
100100

101-
There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/master/datasets/eli5.
101+
There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/main/datasets/eli5.
102102

103103
Don't spend too much time completing the dataset card, just copy what you find when exploring the dataset documentation. If you can't find all the information it's ok. You can always spend more time completing the dataset card while we are reviewing your PR (see below) and the dataset card will be open for everybody to complete them afterwards. If you don't know what to write in a section, just leave the `[More Information Needed]` text.
104104

@@ -109,31 +109,31 @@ Now let's get coding :-)
109109

110110
The dataset script is the main entry point to load and process the data. It is a python script under `datasets/<your_dataset_name>/<your_dataset_name>.py`.
111111

112-
There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/master/about_dataset_load.html).
112+
There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/main/about_dataset_load.html).
113113

114114
Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`).
115115

116-
To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/master/templates/new_dataset_script.py):
116+
To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py):
117117

118118
```bash
119119
cp ./templates/new_dataset_script.py ./datasets/<your_dataset_name>/<your_dataset_name>.py
120120
```
121121

122-
And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/master/dataset_script.html).
122+
And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/main/dataset_script.html).
123123

124124
You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case:
125125

126-
- question-answering: [squad](https://github.com/huggingface/datasets/blob/master/datasets/squad/squad.py) (original data are in json)
127-
- natural language inference: [snli](https://github.com/huggingface/datasets/blob/master/datasets/snli/snli.py) (original data are in text files with tab separated columns)
128-
- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py) (original data are in text files with one token per line)
129-
- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/master/datasets/allocine/allocine.py) (original data are in jsonl files)
130-
- text classification: [ag_news](https://github.com/huggingface/datasets/blob/master/datasets/ag_news/ag_news.py) (original data are in csv files)
131-
- translation: [flores](https://github.com/huggingface/datasets/blob/master/datasets/flores/flores.py) (original data come from text files - one per language)
132-
- summarization: [billsum](https://github.com/huggingface/datasets/blob/master/datasets/billsum/billsum.py) (original data are in json files)
133-
- benchmark: [glue](https://github.com/huggingface/datasets/blob/master/datasets/glue/glue.py) (original data are various formats)
134-
- multilingual: [xquad](https://github.com/huggingface/datasets/blob/master/datasets/xquad/xquad.py) (original data are in json)
135-
- multitask: [matinf](https://github.com/huggingface/datasets/blob/master/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication)
136-
- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/master/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format)
126+
- question-answering: [squad](https://github.com/huggingface/datasets/blob/main/datasets/squad/squad.py) (original data are in json)
127+
- natural language inference: [snli](https://github.com/huggingface/datasets/blob/main/datasets/snli/snli.py) (original data are in text files with tab separated columns)
128+
- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/main/datasets/conll2003/conll2003.py) (original data are in text files with one token per line)
129+
- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/main/datasets/allocine/allocine.py) (original data are in jsonl files)
130+
- text classification: [ag_news](https://github.com/huggingface/datasets/blob/main/datasets/ag_news/ag_news.py) (original data are in csv files)
131+
- translation: [flores](https://github.com/huggingface/datasets/blob/main/datasets/flores/flores.py) (original data come from text files - one per language)
132+
- summarization: [billsum](https://github.com/huggingface/datasets/blob/main/datasets/billsum/billsum.py) (original data are in json files)
133+
- benchmark: [glue](https://github.com/huggingface/datasets/blob/main/datasets/glue/glue.py) (original data are various formats)
134+
- multilingual: [xquad](https://github.com/huggingface/datasets/blob/main/datasets/xquad/xquad.py) (original data are in json)
135+
- multitask: [matinf](https://github.com/huggingface/datasets/blob/main/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication)
136+
- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/main/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format)
137137

138138
While you are developing the dataset script you can list test it by opening a python interpreter and running the script (the script is dynamically updated each time you modify it):
139139

@@ -286,18 +286,18 @@ Here are the step to open the Pull-Request on the main repo.
286286
It is a good idea to sync your copy of the code with the original
287287
repository regularly. This way you can quickly account for changes:
288288

289-
- If you haven't pushed your branch yet, you can rebase on upstream/master:
289+
- If you haven't pushed your branch yet, you can rebase on upstream/main:
290290

291291
```bash
292292
git fetch upstream
293-
git rebase upstream/master
293+
git rebase upstream/main
294294
```
295295
296296
- If you have already pushed your branch, do not rebase but merge instead:
297297

298298
```bash
299299
git fetch upstream
300-
git merge upstream/master
300+
git merge upstream/main
301301
```
302302

303303
Push the changes to your account using:
@@ -334,7 +334,7 @@ Creating the dataset card goes in two steps:
334334

335335
- **Very important as well:** On the right side of the tagging app, you will also find an expandable section called **Show Markdown Data Fields**. This gives you a starting point for the description of the fields in your dataset: you should paste it into the **Data Fields** section of the [online form](https://huggingface.co/datasets/card-creator/) (or your local README.md), then modify the description as needed. Briefly describe each of the fields and indicate if they have a default value (e.g. when there is no label). If the data has span indices, describe their attributes (character level or word level, contiguous or not, etc). If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
336336

337-
Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/master/datasets/eli5#data-fields):
337+
Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/main/datasets/eli5#data-fields):
338338

339339
Data Fields:
340340
- q_id: a string question identifier for each example, corresponding to its ID in the Pushshift.io Reddit submission dumps.
@@ -343,9 +343,9 @@ Creating the dataset card goes in two steps:
343343
- title_urls: list of the extracted URLs, the nth element of the list was replaced by URL_n
344344

345345

346-
- **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md.
346+
- **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md.
347347

348-
Here is a completed example: https://github.com/huggingface/datasets/tree/master/datasets/eli5 for inspiration
348+
Here is a completed example: https://github.com/huggingface/datasets/tree/main/datasets/eli5 for inspiration
349349

350350
If you don't know what to write in a field and can find it, write: `[More Information Needed]`
351351

CONTRIBUTING.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ If you would like to work on any of the open Issues:
4141
git checkout -b a-descriptive-name-for-my-changes
4242
```
4343

44-
**do not** work on the `master` branch.
44+
**do not** work on the `main` branch.
4545

4646
4. Set up a development environment by running the following command in a virtual environment:
4747

@@ -73,7 +73,7 @@ If you would like to work on any of the open Issues:
7373

7474
```bash
7575
git fetch upstream
76-
git rebase upstream/master
76+
git rebase upstream/main
7777
```
7878

7979
Push the changes to your account using:
@@ -97,15 +97,15 @@ Improving the documentation of datasets is an ever increasing effort and we invi
9797

9898
If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To to do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:
9999

100-
* a [template](https://github.com/huggingface/datasets/blob/master/templates/README.md)
101-
* a [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) describing what information should go into each of the paragraphs
102-
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/master/datasets/eli5/README.md)
100+
* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
101+
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
102+
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/main/datasets/eli5/README.md)
103103

104104
Note that datasets that are outside of a namespace (`squad`, `imagenet-1k`, etc.) are maintained on GitHub. In this case you have to open a Pull request on GitHub to edit the file at `datasets/<dataset-name>/README.md`.
105105

106106
If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.
107107

108-
If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.
108+
If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.
109109

110110
Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).
111111

0 commit comments

Comments
 (0)