Skip to content

Commit 0a96b16

Browse files
committed
docs: share dataset on the hub
1 parent 5e2cf2a commit 0a96b16

File tree

3 files changed

+183
-86
lines changed

3 files changed

+183
-86
lines changed

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,8 @@ The documentation is organized in six parts:
5959
:maxdepth: 2
6060
:caption: Adding new datasets/metrics
6161

62-
add_dataset
6362
share_dataset
63+
add_dataset
6464
add_metric
6565

6666
.. toctree::

docs/source/loading_datasets.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,15 @@ In the following example we specify which subset of the files to use with the ``
184184
>>> from datasets import load_dataset
185185
>>> c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz')
186186
187-
In this example, ``load_dataset`` will return all the files that match the Unix style pattern passed in ``data_files``.
187+
188+
You can also specify custom splits:
189+
190+
.. code-block::
191+
192+
>>> data_files = {"validation": "en/c4-validation.*.json.gz"}
193+
>>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")
194+
195+
In these examples, ``load_dataset`` will return all the files that match the Unix style pattern passed in ``data_files``.
188196
If you don't specify which data files to use, it will use all the data files (here all C4 is about 13TB of data).
189197

190198

docs/source/share_dataset.rst

Lines changed: 173 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,25 @@
11
Sharing your dataset
22
=============================================
33

4-
Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
4+
Once you have your dataset, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
55

6-
- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets <https://github.com/huggingface/datasets>`__,
76
- directly upload it on the Hub as a community provided dataset.
7+
- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets <https://github.com/huggingface/datasets>`__,
8+
9+
Both options offer the same features such as:
10+
11+
- dataset versioning
12+
- commit history and diffs
13+
- metadata for discoverability
14+
- dataset cards for documentation, licensing, limitations, etc.
815

916
Here are the main differences between these two options.
1017

1118
- **Community provided** datasets:
1219
* are faster to share (no reviewing process)
1320
* can contain the data files themselves on the Hub
1421
* are identified under the namespace of a user or organization: ``thomwolf/my_dataset`` or ``huggingface/our_dataset``
15-
* are flagged as ``unsafe`` by default because a dataset contains executable code so the users need to inspect and opt-in to use the datasets
22+
* are flagged as ``unsafe`` by default because a dataset may contain executable code so the users need to inspect and opt-in to use the datasets
1623

1724
- **Canonical** datasets:
1825
* are slower to add (need to go through the reviewing process on the githup repo)
@@ -22,81 +29,7 @@ Here are the main differences between these two options.
2229

2330
.. note::
2431

25-
The distinctions between "canonical" and "community provided" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self.
26-
27-
.. _canonical-dataset:
28-
29-
Sharing a "canonical" dataset
30-
--------------------------------
31-
32-
To add a "canonical" dataset to the library, you need to go through the following steps:
33-
34-
**1. Fork the** `🤗 Datasets repository <https://github.com/huggingface/datasets>`__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account.
35-
36-
**2. Clone your fork** to your local disk, and add the base repository as a remote:
37-
38-
.. code::
39-
40-
git clone https://github.com/<your_Github_handle>/datasets
41-
cd datasets
42-
git remote add upstream https://github.com/huggingface/datasets.git
43-
44-
45-
**3. Create a new branch** to hold your development changes:
46-
47-
.. code::
48-
49-
git checkout -b my-new-dataset
50-
51-
.. note::
52-
53-
**Do not** work on the ``master`` branch.
54-
55-
**4. Set up a development environment** by running the following command **in a virtual environment**:
56-
57-
.. code::
58-
59-
pip install -e ".[dev]"
60-
61-
.. note::
62-
63-
If 🤗 Datasets was already installed in the virtual environment, remove
64-
it with ``pip uninstall datasets`` before reinstalling it in editable
65-
mode with the ``-e`` flag.
66-
67-
**5. Create a new folder with your dataset name** inside the `datasets folder <https://github.com/huggingface/datasets/tree/master/datasets>`__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page.
68-
69-
**6. Format your code.** Run black and isort so that your newly added files look nice with the following command:
70-
71-
.. code::
72-
73-
make style
74-
make quality
75-
76-
77-
**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**:
78-
79-
.. code::
80-
81-
git add datasets/<my-new-dataset>
82-
git commit
83-
84-
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
85-
86-
.. code::
87-
88-
git fetch upstream
89-
git rebase upstream/master
90-
91-
Push the changes to your account using:
92-
93-
.. code::
94-
95-
git push -u origin my-new-dataset
96-
97-
**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so.
98-
99-
**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository <https://github.com/huggingface/datasets>`__ for review.
32+
The distinctions between "community provided" and "canonical" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self.
10033

10134
.. _community-dataset:
10235

@@ -114,6 +47,18 @@ In this page, we will show you how to share a dataset with the community on the
11447
Prepare your dataset for uploading
11548
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11649

50+
You can either have your dataset in a supported format (csv/jsonl/json/parquet/txt), or use a dataset script to define how to load your data.
51+
52+
If your dataset is in a supported format, you're all set !
53+
Otherwise, you need a dataset script. It simply is a python script and its role is to define:
54+
55+
- the feature types of your data
56+
- how your dataset is split into train/validation/test (or any other splits)
57+
- how to download the data
58+
- how to process the data
59+
60+
The dataset script is mandatory if your dataset is not in the supported formats, or if you need more control on how to define our dataset.
61+
11762
We have seen in the :doc:`dataset script tutorial <add_dataset>`: how to write a dataset loading script. Let's see how you can share it on the
11863
`🤗 Datasets Hub <https://huggingface.co/datasets>`__.
11964

@@ -209,10 +154,10 @@ Check the directory before pushing to the 🤗 Datasets Hub.
209154

210155
Make sure there are no garbage files in the directory you'll upload. It should only have:
211156

212-
- a `your_dataset_name.py` file, which is the dataset script;
157+
- a `your_dataset_name.py` file, which is the dataset script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt);
158+
- the raw data files (json, csv, txt, mp3, png, etc.) that you need for your dataset
213159
- an optional `dataset_infos.json` file, which contains metadata about your dataset like the split sizes;
214160
- optional dummy data files, which contains only a small subset from the dataset for tests and preview;
215-
- your raw data files (json, csv, txt, etc.) that you need for your dataset
216161

217162
Other files can safely be deleted.
218163

@@ -276,6 +221,18 @@ Anyone can load it from code:
276221
>>> dataset = load_dataset("namespace/your_dataset_name")
277222
278223
224+
If your dataset doesn't have a dataset script, then by default all your data will be loaded in the "train" split.
225+
You can specify which files goes to which split by specifying the ``data_files`` parameter.
226+
227+
Let's say your dataset repository contains one CSV file for the train split, and one CSV file for your test split. Then you can load it with:
228+
229+
230+
.. code-block::
231+
232+
>>> data_files = {"train": "train.csv", "test": "test.csv"}
233+
>>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)
234+
235+
279236
You may specify a version by using the ``script_version`` flag in the ``load_dataset`` function:
280237

281238
.. code-block::
@@ -285,11 +242,90 @@ You may specify a version by using the ``script_version`` flag in the ``load_dat
285242
>>> script_version="main" # tag name, or branch name, or commit hash
286243
>>> )
287244
245+
You can find more information in the guide on :doc:`how to load a dataset </loading_datasets>`
246+
247+
.. _canonical-dataset:
248+
249+
Sharing a "canonical" dataset
250+
--------------------------------
251+
252+
Add your dataset to the GitHub repository
253+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
254+
255+
To add a "canonical" dataset to the library, you need to go through the following steps:
256+
257+
**1. Fork the** `🤗 Datasets repository <https://github.com/huggingface/datasets>`__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account.
258+
259+
**2. Clone your fork** to your local disk, and add the base repository as a remote:
260+
261+
.. code::
262+
263+
git clone https://github.com/<your_Github_handle>/datasets
264+
cd datasets
265+
git remote add upstream https://github.com/huggingface/datasets.git
266+
267+
268+
**3. Create a new branch** to hold your development changes:
269+
270+
.. code::
271+
272+
git checkout -b my-new-dataset
273+
274+
.. note::
275+
276+
**Do not** work on the ``master`` branch.
277+
278+
**4. Set up a development environment** by running the following command **in a virtual environment**:
279+
280+
.. code::
281+
282+
pip install -e ".[dev]"
283+
284+
.. note::
285+
286+
If 🤗 Datasets was already installed in the virtual environment, remove
287+
it with ``pip uninstall datasets`` before reinstalling it in editable
288+
mode with the ``-e`` flag.
289+
290+
**5. Create a new folder with your dataset name** inside the `datasets folder <https://github.com/huggingface/datasets/tree/master/datasets>`__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page.
291+
292+
**6. Format your code.** Run black and isort so that your newly added files look nice with the following command:
293+
294+
.. code::
295+
296+
make style
297+
make quality
298+
299+
300+
**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**:
301+
302+
.. code::
303+
304+
git add datasets/<my-new-dataset>
305+
git commit
306+
307+
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
308+
309+
.. code::
310+
311+
git fetch upstream
312+
git rebase upstream/master
313+
314+
Push the changes to your account using:
315+
316+
.. code::
317+
318+
git push -u origin my-new-dataset
319+
320+
**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so.
321+
322+
**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository <https://github.com/huggingface/datasets>`__ for review.
323+
288324

289325
.. _adding-tests:
290326

291327
Adding tests and metadata to the dataset
292-
---------------------------------------------
328+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
293329

294330
We recommend adding testing data and checksum metadata to your dataset so its behavior can be tested and verified, and the generated dataset can be certified. In this section we'll explain how you can add two objects to the repository to do just that:
295331

@@ -302,7 +338,7 @@ We recommend adding testing data and checksum metadata to your dataset so its be
302338
In the rest of this section, you should make sure that you run all of the commands **from the root** of your local ``datasets`` repository.
303339

304340
1. Adding metadata
305-
^^^^^^^^^^^^^^^^^^^^^^^^^^
341+
~~~~~~~~~~~~~~~~~~~~~~~~~~
306342

307343
You can check that the new dataset loading script works correctly and create the ``dataset_infos.json`` file at the same time by running the command:
308344

@@ -373,7 +409,7 @@ If the command was succesful, you should now have a ``dataset_infos.json`` file
373409
}
374410
375411
2. Adding dummy data
376-
^^^^^^^^^^^^^^^^^^^^^^^^^^
412+
~~~~~~~~~~~~~~~~~~~~~~~~~~
377413

378414
Now that we have the metadata prepared we can also create some dummy data for automated testing. You can use the following command to get in-detail instructions on how to create the dummy data:
379415

@@ -465,7 +501,7 @@ Usage of the command:
465501
466502
467503
3. Testing
468-
^^^^^^^^^^^^^^^^^^^^^^^^^^
504+
~~~~~~~~~~~~~~~~~~~~~~~~~~
469505

470506
Now test that both the real data and the dummy data work correctly. Go back to the root of your datasets folder and use the following command:
471507

@@ -496,3 +532,56 @@ and make sure you follow the exact instructions provided by the command.
496532
- Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function ``_split_generators(...)`` and expected by the function ``_generate_examples(...)`` of your dataset script. Also take a look at `tests/README.md` which lists different possible cases of how the dummy data should be created.
497533

498534
- If the dummy data tests still fail, open a PR in the main repository on github and make a remark in the description that you need help creating the dummy data and we will be happy to help you.
535+
536+
537+
Add a Dataset Card
538+
--------------------------------
539+
540+
Once your dataset is ready for sharing, feel free to write and add a Dataset Card to document your dataset.
541+
542+
The Dataset Card is a file ``README.md`` file that you may add in your dataset repository.
543+
544+
At the top of the Dataset Card, you can define the metadata of your dataset for discoverability:
545+
546+
- annotations_creators
547+
- language_creators
548+
- languages
549+
- licenses
550+
- multilinguality
551+
- pretty_name
552+
- size_categories
553+
- source_datasets
554+
- task_categories
555+
- task_ids
556+
- paperswithcode_id
557+
558+
It may contain diverse sections to document all the relevant aspects of your dataset:
559+
560+
- Dataset Description
561+
- Dataset Summary
562+
- Supported Tasks and Leaderboards
563+
- Languages
564+
- Dataset Structure
565+
- Data Instances
566+
- Data Fields
567+
- Data Splits
568+
- Dataset Creation
569+
- Curation Rationale
570+
- Source Data
571+
- Initial Data Collection and Normalization
572+
- Who are the source language producers?
573+
- Annotations
574+
- Annotation process
575+
- Who are the annotators?
576+
- Personal and Sensitive Information
577+
- Considerations for Using the Data
578+
- Social Impact of Dataset
579+
- Discussion of Biases
580+
- Other Known Limitations
581+
- Additional Information
582+
- Dataset Curators
583+
- Licensing Information
584+
- Citation Information
585+
- Contributions
586+
587+
You can find more information about each section in the `Dataset Card guide <https://github.com/huggingface/datasets/blob/master/templates/README_guide.md>`_.

0 commit comments

Comments
 (0)