New documentation structure #2718

stevhliu · 2021-07-26T23:15:13Z

Organize Datasets documentation into four documentation types to improve clarity and discoverability of content.

Content to add in the very short term (feel free to add anything I'm missing):

A discussion on why Datasets uses Arrow that includes some context and background about why we use Arrow. Would also be great to talk about Datasets speed and performance here, and if you can share any benchmarking/tests you did, that would be awesome! Finally, a discussion about how memory-mapping frees the user from RAM constraints would be very helpful.
Explain why you would want to disable or override verifications when loading a dataset.
If possible, include a code sample of when the number of elements in the field of an output dictionary aren’t the same as the other fields in the output dictionary (taken from the note here).

add tutorial section

remove old files

add most of the completed how-to guides

add distributed usage guides

add datasets + arrow concept guide

lhoestq

Awesome job thank you !! I love it :)

I added many comments, but most of them are minor

Could you also add sphinx-panels to the extensions to install for the documentation ?

datasets/setup.py

Lines 186 to 196 in 5528e10

    
           "docs": [ 
        
               "docutils==0.16.0", 
        
               "recommonmark", 
        
               "sphinx==3.1.2", 
        
               "sphinx-markdown-tables", 
        
               "sphinx-rtd-theme==0.4.3", 
        
               "sphinxext-opengraph==0.4.1", 
        
               "sphinx-copybutton", 
        
               "fsspec", 
        
               "s3fs", 
        
           ],

This way the CI will automatically build the documentation every time you do a commit to this PR :) This is useful to be able to look at the actual rendering of the documentation and browse it, so that we can give even more feedbacks ^^

Finally feel free to add some TODOs when you feel like a section could be added to explain something, so that we can start working on this as well :)

docs/source/about_map_batch.rst

docs/source/faiss_es.rst

docs/source/fss.rst

docs/source/process.rst

docs/source/share.rst

add concept guides for loading/building a dataset

docs/source/about_arrow.md

docs/source/about_cache.rst

stevhliu · 2021-07-30T01:08:59Z

docs/source/about_dataset_features.rst

+
+Lastly, there are two specific features for machine translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`. 
+
+[I think for the translation features, we should either add some example code like we did for the other features or remove it all together.]


I think maybe we should add sample code for the translation classes and a brief explanation on how to use it. Otherwise we can just remove it and refer the user to the package reference.

docs/source/about_dataset_load.rst

docs/source/about_map_batch.rst

stevhliu · 2021-07-30T01:12:10Z

docs/source/about_metrics.rst

+
+This method allows Datasets to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory.
+
+TO DO: More explanation on how the file locks perform the synchronization, or remove this part.


TODO: Add some more explanation about how the file lock synchronization works.

add myst-parser to setup.py

fix commas

add how-to for parquet, fix sphinx-panels and minor feedback changes

docs/source/share.rst

lhoestq · 2021-09-08T16:57:35Z

I just separated the Share How-to page into three pages: share, dataset_script and dataset_card.

This way in the share page we can explain in more details how to share a community or a canonical dataset - focus in their differences and the steps to upload them.

Also given that making a dataset script or a dataset card both require several steps, I feel like it's better to have dedicated pages for them.

Let me know what you think @stevhliu and others. We can still revert this change if you feel like it was better with everything in the same place.

stevhliu · 2021-09-09T00:53:16Z

I just added some minor changes to match the style, fix typos, etc. Great work on the conceptual guides, I learned a lot from them and I'm sure they will help a lot of other people too!

I am fine with splitting Share into three separate pages. I think this probably makes it easier for users to navigate, instead of having to scroll up and down on a really long single page.

lhoestq

Thanks for the corrections :)

It looks all good to me !

lewtun

Thank you so much @stevhliu for the great improvements you made to the documentation - I especially like the care you took to explain some pretty hairy concepts like caching 🥳

I left mostly small nits (feel free to ignore them if you disagree) and a few questions

docs/source/about_arrow.md

lewtun · 2021-09-09T08:50:51Z

docs/source/about_arrow.md

+
+## What is Arrow?
+
+[Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:


This is a super nice summary of Apache Arrow 😍

docs/source/about_arrow.md

docs/source/quickstart.rst

docs/source/share.rst

docs/source/index.rst

Co-authored-by: lewtun <[email protected]>

albertvillanova

Again, this is an awesome work! Thanks @stevhliu. Our new docs will definitely make using datasets much easier!

Some questions, comments and nits below...

docs/source/about_cache.rst

docs/source/about_dataset_features.rst

docs/source/about_dataset_load.rst

docs/source/about_map_batch.rst

docs/source/about_metrics.rst

docs/source/about_dataset_load.rst

Co-authored-by: Albert Villanova del Moral <[email protected]>

albertvillanova

NIT: the official name of GitHub is with uppercase "H".

docs/source/about_dataset_load.rst

docs/source/about_metrics.rst

docs/source/share.rst

docs/source/dataset_script.rst

albertvillanova · 2021-09-12T19:23:17Z

docs/source/loading.rst

+Load
+====
+
+You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.


Suggested change

You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.

You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a GitHub repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.

A dataset can be on your local machine ==> A dataset can be on disk on your local machine

and in data structures like Python dictionaries and Pandas DataFrames ==>

or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
OR:

or in in-memory data structures like Python dictionaries and Pandas DataFrames

docs/source/loading.rst

albertvillanova

Somme additional comment, suggestions and questions.

albertvillanova · 2021-09-13T04:37:54Z

docs/source/how_to.md

+* How to compute metrics.
+
+* How to upload and share a dataset.


To keep the logical order and the same order as in the navigation.

Suggested change

* How to compute metrics.

* How to upload and share a dataset.

* How to upload and share a dataset.

* How to compute metrics.

Maybe adding a mention to the new subsections?

how to create a dataset loading script,

how to create a dataset card

I would suggest also renaming the new subsections, by adding a verb in their title: note that all belong to the How-to section: hot to load a dataset, hot to process a dataset, how to stream a dataset,...

Dataset script ==> Create dataset loading script

Dataset card ==> Create dataset card

albertvillanova · 2021-09-13T05:14:00Z

docs/source/loading.rst

+Load
+====
+
+You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.


A dataset can be on your local machine ==> A dataset can be on disk on your local machine

and in data structures like Python dictionaries and Pandas DataFrames ==>

or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
OR:

or in in-memory data structures like Python dictionaries and Pandas DataFrames

docs/source/loading.rst

albertvillanova · 2021-09-13T08:18:26Z

docs/source/dataset_script.rst

+       self.citation = citation
+       self.url = url
+
+2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:


This is redundant with previous item. There we already had to subclass BuilderConfig in order to add additional attributes, like features, label_classes and citation. Note that the only attributes of base BuilderConfig are: name, version, data_dir, data_files and description.

Suggested change

2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:

2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify all the name and description of each configuration. These sub-class instances should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:

I think this point is about creating the config instances and list them in BUILDER_CONFIGS.

Let me try to rephrase it

docs/source/dataset_script.rst

Co-authored-by: Albert Villanova del Moral <[email protected]>

lhoestq · 2021-09-13T14:17:29Z

Thanks a lot for all the suggestions ! I'm doing the final changes based on the remaining comments, then we can merge and release v1.12 of datasets and the new documentation ^^

lhoestq · 2021-09-13T17:02:47Z

Alright I think I took all the suggestions and comments into account :)
Thanks everyone for the help !

Steven added 9 commits July 16, 2021 08:44

1984bac

add tutorial section

Merge remote-tracking branch 'origin/master'

086a19a

da84f8e

remove old files

add instructions for venv and dataset builder to tutorial

d5c9d33

80cf49f

add most of the completed how-to guides

42f5c7a

add distributed usage guides

5803ed8

add datasets + arrow concept guide

add concept guide for dataset features

e8fa2cf

add concept guides for cache, load, map_batch

c52c4b4

lhoestq reviewed Jul 29, 2021

View reviewed changes

17a0e24

add concept guides for loading/building a dataset