You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
4
+
Once you have your dataset, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
5
5
6
-
- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets <https://github.com/huggingface/datasets>`__,
7
6
- directly upload it on the Hub as a community provided dataset.
7
+
- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets <https://github.com/huggingface/datasets>`__,
8
+
9
+
Both options offer the same features such as:
10
+
11
+
- dataset versioning
12
+
- commit history and diffs
13
+
- metadata for discoverability
14
+
- dataset cards for documentation, licensing, limitations, etc.
8
15
9
16
Here are the main differences between these two options.
10
17
11
18
- **Community provided** datasets:
12
19
* are faster to share (no reviewing process)
13
20
* can contain the data files themselves on the Hub
14
21
* are identified under the namespace of a user or organization: ``thomwolf/my_dataset`` or ``huggingface/our_dataset``
15
-
* are flagged as ``unsafe`` by default because a dataset contains executable code so the users need to inspect and opt-in to use the datasets
22
+
* are flagged as ``unsafe`` by default because a dataset may contain executable code so the users need to inspect and opt-in to use the datasets
16
23
17
24
- **Canonical** datasets:
18
25
* are slower to add (need to go through the reviewing process on the githup repo)
@@ -22,81 +29,7 @@ Here are the main differences between these two options.
22
29
23
30
.. note::
24
31
25
-
The distinctions between "canonical" and "community provided" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self.
26
-
27
-
.. _canonical-dataset:
28
-
29
-
Sharing a "canonical" dataset
30
-
--------------------------------
31
-
32
-
To add a "canonical" dataset to the library, you need to go through the following steps:
33
-
34
-
**1. Fork the** `🤗 Datasets repository <https://github.com/huggingface/datasets>`__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account.
35
-
36
-
**2. Clone your fork** to your local disk, and add the base repository as a remote:
**3. Create a new branch** to hold your development changes:
46
-
47
-
.. code::
48
-
49
-
git checkout -b my-new-dataset
50
-
51
-
.. note::
52
-
53
-
**Do not** work on the ``master`` branch.
54
-
55
-
**4. Set up a development environment** by running the following command **in a virtual environment**:
56
-
57
-
.. code::
58
-
59
-
pip install -e ".[dev]"
60
-
61
-
.. note::
62
-
63
-
If 🤗 Datasets was already installed in the virtual environment, remove
64
-
it with ``pip uninstall datasets`` before reinstalling it in editable
65
-
mode with the ``-e`` flag.
66
-
67
-
**5. Create a new folder with your dataset name** inside the `datasets folder <https://github.com/huggingface/datasets/tree/master/datasets>`__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page.
68
-
69
-
**6. Format your code.** Run black and isort so that your newly added files look nice with the following command:
70
-
71
-
.. code::
72
-
73
-
make style
74
-
make quality
75
-
76
-
77
-
**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**:
78
-
79
-
.. code::
80
-
81
-
git add datasets/<my-new-dataset>
82
-
git commit
83
-
84
-
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
85
-
86
-
.. code::
87
-
88
-
git fetch upstream
89
-
git rebase upstream/master
90
-
91
-
Push the changes to your account using:
92
-
93
-
.. code::
94
-
95
-
git push -u origin my-new-dataset
96
-
97
-
**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so.
98
-
99
-
**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository <https://github.com/huggingface/datasets>`__ for review.
32
+
The distinctions between "community provided" and "canonical" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self.
100
33
101
34
.. _community-dataset:
102
35
@@ -114,6 +47,18 @@ In this page, we will show you how to share a dataset with the community on the
@@ -209,10 +154,10 @@ Check the directory before pushing to the 🤗 Datasets Hub.
209
154
210
155
Make sure there are no garbage files in the directory you'll upload. It should only have:
211
156
212
-
- a `your_dataset_name.py` file, which is the dataset script;
157
+
- a `your_dataset_name.py` file, which is the dataset script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt);
158
+
- the raw data files (json, csv, txt, mp3, png, etc.) that you need for your dataset
213
159
- an optional `dataset_infos.json` file, which contains metadata about your dataset like the split sizes;
214
160
- optional dummy data files, which contains only a small subset from the dataset for tests and preview;
215
-
- your raw data files (json, csv, txt, etc.) that you need for your dataset
216
161
217
162
Other files can safely be deleted.
218
163
@@ -276,6 +221,18 @@ Anyone can load it from code:
You may specify a version by using the ``script_version`` flag in the ``load_dataset`` function:
280
237
281
238
.. code-block::
@@ -285,11 +242,90 @@ You may specify a version by using the ``script_version`` flag in the ``load_dat
285
242
>>> script_version="main" # tag name, or branch name, or commit hash
286
243
>>> )
287
244
245
+
You can find more information in the guide on :doc:`how to load a dataset </loading_datasets>`
246
+
247
+
.. _canonical-dataset:
248
+
249
+
Sharing a "canonical" dataset
250
+
--------------------------------
251
+
252
+
Add your dataset to the GitHub repository
253
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
254
+
255
+
To add a "canonical" dataset to the library, you need to go through the following steps:
256
+
257
+
**1. Fork the** `🤗 Datasets repository <https://github.com/huggingface/datasets>`__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account.
258
+
259
+
**2. Clone your fork** to your local disk, and add the base repository as a remote:
**3. Create a new branch** to hold your development changes:
269
+
270
+
.. code::
271
+
272
+
git checkout -b my-new-dataset
273
+
274
+
.. note::
275
+
276
+
**Do not** work on the ``master`` branch.
277
+
278
+
**4. Set up a development environment** by running the following command **in a virtual environment**:
279
+
280
+
.. code::
281
+
282
+
pip install -e ".[dev]"
283
+
284
+
.. note::
285
+
286
+
If 🤗 Datasets was already installed in the virtual environment, remove
287
+
it with ``pip uninstall datasets`` before reinstalling it in editable
288
+
mode with the ``-e`` flag.
289
+
290
+
**5. Create a new folder with your dataset name** inside the `datasets folder <https://github.com/huggingface/datasets/tree/master/datasets>`__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page.
291
+
292
+
**6. Format your code.** Run black and isort so that your newly added files look nice with the following command:
293
+
294
+
.. code::
295
+
296
+
make style
297
+
make quality
298
+
299
+
300
+
**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**:
301
+
302
+
.. code::
303
+
304
+
git add datasets/<my-new-dataset>
305
+
git commit
306
+
307
+
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
308
+
309
+
.. code::
310
+
311
+
git fetch upstream
312
+
git rebase upstream/master
313
+
314
+
Push the changes to your account using:
315
+
316
+
.. code::
317
+
318
+
git push -u origin my-new-dataset
319
+
320
+
**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so.
321
+
322
+
**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository <https://github.com/huggingface/datasets>`__ for review.
323
+
288
324
289
325
.. _adding-tests:
290
326
291
327
Adding tests and metadata to the dataset
292
-
---------------------------------------------
328
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
293
329
294
330
We recommend adding testing data and checksum metadata to your dataset so its behavior can be tested and verified, and the generated dataset can be certified. In this section we'll explain how you can add two objects to the repository to do just that:
295
331
@@ -302,7 +338,7 @@ We recommend adding testing data and checksum metadata to your dataset so its be
302
338
In the rest of this section, you should make sure that you run all of the commands **from the root** of your local ``datasets`` repository.
303
339
304
340
1. Adding metadata
305
-
^^^^^^^^^^^^^^^^^^^^^^^^^^
341
+
~~~~~~~~~~~~~~~~~~~~~~~~~~
306
342
307
343
You can check that the new dataset loading script works correctly and create the ``dataset_infos.json`` file at the same time by running the command:
308
344
@@ -373,7 +409,7 @@ If the command was succesful, you should now have a ``dataset_infos.json`` file
373
409
}
374
410
375
411
2. Adding dummy data
376
-
^^^^^^^^^^^^^^^^^^^^^^^^^^
412
+
~~~~~~~~~~~~~~~~~~~~~~~~~~
377
413
378
414
Now that we have the metadata prepared we can also create some dummy data for automated testing. You can use the following command to get in-detail instructions on how to create the dummy data:
379
415
@@ -465,7 +501,7 @@ Usage of the command:
465
501
466
502
467
503
3. Testing
468
-
^^^^^^^^^^^^^^^^^^^^^^^^^^
504
+
~~~~~~~~~~~~~~~~~~~~~~~~~~
469
505
470
506
Now test that both the real data and the dummy data work correctly. Go back to the root of your datasets folder and use the following command:
471
507
@@ -496,3 +532,56 @@ and make sure you follow the exact instructions provided by the command.
496
532
- Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function ``_split_generators(...)`` and expected by the function ``_generate_examples(...)`` of your dataset script. Also take a look at `tests/README.md` which lists different possible cases of how the dummy data should be created.
497
533
498
534
- If the dummy data tests still fail, open a PR in the main repository on github and make a remark in the description that you need help creating the dummy data and we will be happy to help you.
535
+
536
+
537
+
Add a Dataset Card
538
+
--------------------------------
539
+
540
+
Once your dataset is ready for sharing, feel free to write and add a Dataset Card to document your dataset.
541
+
542
+
The Dataset Card is a file ``README.md`` file that you may add in your dataset repository.
543
+
544
+
At the top of the Dataset Card, you can define the metadata of your dataset for discoverability:
545
+
546
+
- annotations_creators
547
+
- language_creators
548
+
- languages
549
+
- licenses
550
+
- multilinguality
551
+
- pretty_name
552
+
- size_categories
553
+
- source_datasets
554
+
- task_categories
555
+
- task_ids
556
+
- paperswithcode_id
557
+
558
+
It may contain diverse sections to document all the relevant aspects of your dataset:
559
+
560
+
- Dataset Description
561
+
- Dataset Summary
562
+
- Supported Tasks and Leaderboards
563
+
- Languages
564
+
- Dataset Structure
565
+
- Data Instances
566
+
- Data Fields
567
+
- Data Splits
568
+
- Dataset Creation
569
+
- Curation Rationale
570
+
- Source Data
571
+
- Initial Data Collection and Normalization
572
+
- Who are the source language producers?
573
+
- Annotations
574
+
- Annotation process
575
+
- Who are the annotators?
576
+
- Personal and Sensitive Information
577
+
- Considerations for Using the Data
578
+
- Social Impact of Dataset
579
+
- Discussion of Biases
580
+
- Other Known Limitations
581
+
- Additional Information
582
+
- Dataset Curators
583
+
- Licensing Information
584
+
- Citation Information
585
+
- Contributions
586
+
587
+
You can find more information about each section in the `Dataset Card guide <https://github.com/huggingface/datasets/blob/master/templates/README_guide.md>`_.
0 commit comments