Skip to content

Conversation

@stevhliu
Copy link
Member

@stevhliu stevhliu commented Jul 26, 2021

Organize Datasets documentation into four documentation types to improve clarity and discoverability of content.

Content to add in the very short term (feel free to add anything I'm missing):

  • A discussion on why Datasets uses Arrow that includes some context and background about why we use Arrow. Would also be great to talk about Datasets speed and performance here, and if you can share any benchmarking/tests you did, that would be awesome! Finally, a discussion about how memory-mapping frees the user from RAM constraints would be very helpful.
  • Explain why you would want to disable or override verifications when loading a dataset.
  • If possible, include a code sample of when the number of elements in the field of an output dictionary aren’t the same as the other fields in the output dictionary (taken from the note here).

Steven added 9 commits July 16, 2021 08:44
add tutorial section
remove old files
add most of the completed how-to guides
add distributed usage guides
add datasets + arrow concept guide
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job thank you !! I love it :)

I added many comments, but most of them are minor

Could you also add sphinx-panels to the extensions to install for the documentation ?

datasets/setup.py

Lines 186 to 196 in 5528e10

"docs": [
"docutils==0.16.0",
"recommonmark",
"sphinx==3.1.2",
"sphinx-markdown-tables",
"sphinx-rtd-theme==0.4.3",
"sphinxext-opengraph==0.4.1",
"sphinx-copybutton",
"fsspec",
"s3fs",
],

This way the CI will automatically build the documentation every time you do a commit to this PR :) This is useful to be able to look at the actual rendering of the documentation and browse it, so that we can give even more feedbacks ^^

Finally feel free to add some TODOs when you feel like a section could be added to explain something, so that we can start working on this as well :)

add concept guides for loading/building a dataset

Lastly, there are two specific features for machine translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`.

[I think for the translation features, we should either add some example code like we did for the other features or remove it all together.] No newline at end of file
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should add sample code for the translation classes and a brief explanation on how to use it. Otherwise we can just remove it and refer the user to the package reference.


This method allows Datasets to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory.

TO DO: More explanation on how the file locks perform the synchronization, or remove this part. No newline at end of file
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add some more explanation about how the file lock synchronization works.

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2021

I just separated the Share How-to page into three pages: share, dataset_script and dataset_card.

This way in the share page we can explain in more details how to share a community or a canonical dataset - focus in their differences and the steps to upload them.

Also given that making a dataset script or a dataset card both require several steps, I feel like it's better to have dedicated pages for them.

Let me know what you think @stevhliu and others. We can still revert this change if you feel like it was better with everything in the same place.

@stevhliu
Copy link
Member Author

stevhliu commented Sep 9, 2021

I just added some minor changes to match the style, fix typos, etc. Great work on the conceptual guides, I learned a lot from them and I'm sure they will help a lot of other people too!

I am fine with splitting Share into three separate pages. I think this probably makes it easier for users to navigate, instead of having to scroll up and down on a really long single page.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the corrections :)

It looks all good to me !

@lhoestq lhoestq changed the title Docs structure New documentation structure Sep 9, 2021
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @stevhliu for the great improvements you made to the documentation - I especially like the care you took to explain some pretty hairy concepts like caching 🥳

I left mostly small nits (feel free to ignore them if you disagree) and a few questions


## What is Arrow?

[Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super nice summary of Apache Arrow 😍

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is an awesome work! Thanks @stevhliu. Our new docs will definitely make using datasets much easier!

Some questions, comments and nits below...

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: the official name of GitHub is with uppercase "H".

Load
====

You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.
You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a GitHub repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A dataset can be on your local machine ==> A dataset can be on disk on your local machine
  • and in data structures like Python dictionaries and Pandas DataFrames ==>
    • or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
      OR:
    • or in in-memory data structures like Python dictionaries and Pandas DataFrames

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somme additional comment, suggestions and questions.

Comment on lines 13 to 15
* How to compute metrics.

* How to upload and share a dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the logical order and the same order as in the navigation.

Suggested change
* How to compute metrics.
* How to upload and share a dataset.
* How to upload and share a dataset.
* How to compute metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding a mention to the new subsections?

  • how to create a dataset loading script,
  • how to create a dataset card

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest also renaming the new subsections, by adding a verb in their title: note that all belong to the How-to section: hot to load a dataset, hot to process a dataset, how to stream a dataset,...

  • Dataset script ==> Create dataset loading script
  • Dataset card ==> Create dataset card

Load
====

You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A dataset can be on your local machine ==> A dataset can be on disk on your local machine
  • and in data structures like Python dictionaries and Pandas DataFrames ==>
    • or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
      OR:
    • or in in-memory data structures like Python dictionaries and Pandas DataFrames

self.citation = citation
self.url = url
2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with previous item. There we already had to subclass BuilderConfig in order to add additional attributes, like features, label_classes and citation. Note that the only attributes of base BuilderConfig are: name, version, data_dir, data_files and description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:
2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify all the name and description of each configuration. These sub-class instances should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this point is about creating the config instances and list them in BUILDER_CONFIGS.

Let me try to rephrase it

lhoestq and others added 2 commits September 13, 2021 16:02
Co-authored-by: Albert Villanova del Moral <[email protected]>
Co-authored-by: Albert Villanova del Moral <[email protected]>
@lhoestq
Copy link
Member

lhoestq commented Sep 13, 2021

Thanks a lot for all the suggestions ! I'm doing the final changes based on the remaining comments, then we can merge and release v1.12 of datasets and the new documentation ^^

@lhoestq
Copy link
Member

lhoestq commented Sep 13, 2021

Alright I think I took all the suggestions and comments into account :)
Thanks everyone for the help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants