-
Notifications
You must be signed in to change notification settings - Fork 3k
New documentation structure #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add tutorial section
remove old files
add most of the completed how-to guides
add distributed usage guides
add datasets + arrow concept guide
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job thank you !! I love it :)
I added many comments, but most of them are minor
Could you also add sphinx-panels to the extensions to install for the documentation ?
Lines 186 to 196 in 5528e10
| "docs": [ | |
| "docutils==0.16.0", | |
| "recommonmark", | |
| "sphinx==3.1.2", | |
| "sphinx-markdown-tables", | |
| "sphinx-rtd-theme==0.4.3", | |
| "sphinxext-opengraph==0.4.1", | |
| "sphinx-copybutton", | |
| "fsspec", | |
| "s3fs", | |
| ], |
This way the CI will automatically build the documentation every time you do a commit to this PR :) This is useful to be able to look at the actual rendering of the documentation and browse it, so that we can give even more feedbacks ^^
Finally feel free to add some TODOs when you feel like a section could be added to explain something, so that we can start working on this as well :)
add concept guides for loading/building a dataset
|
|
||
| Lastly, there are two specific features for machine translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`. | ||
|
|
||
| [I think for the translation features, we should either add some example code like we did for the other features or remove it all together.] No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we should add sample code for the translation classes and a brief explanation on how to use it. Otherwise we can just remove it and refer the user to the package reference.
docs/source/about_metrics.rst
Outdated
|
|
||
| This method allows Datasets to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory. | ||
|
|
||
| TO DO: More explanation on how the file locks perform the synchronization, or remove this part. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add some more explanation about how the file lock synchronization works.
add myst-parser to setup.py
fix commas
add how-to for parquet, fix sphinx-panels and minor feedback changes
|
I just separated the This way in the share page we can explain in more details how to share a community or a canonical dataset - focus in their differences and the steps to upload them. Also given that making a dataset script or a dataset card both require several steps, I feel like it's better to have dedicated pages for them. Let me know what you think @stevhliu and others. We can still revert this change if you feel like it was better with everything in the same place. |
|
I just added some minor changes to match the style, fix typos, etc. Great work on the conceptual guides, I learned a lot from them and I'm sure they will help a lot of other people too! I am fine with splitting |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the corrections :)
It looks all good to me !
lewtun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @stevhliu for the great improvements you made to the documentation - I especially like the care you took to explain some pretty hairy concepts like caching 🥳
I left mostly small nits (feel free to ignore them if you disagree) and a few questions
|
|
||
| ## What is Arrow? | ||
|
|
||
| [Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super nice summary of Apache Arrow 😍
Co-authored-by: lewtun <[email protected]>
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is an awesome work! Thanks @stevhliu. Our new docs will definitely make using datasets much easier!
Some questions, comments and nits below...
Co-authored-by: Albert Villanova del Moral <[email protected]>
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: the official name of GitHub is with uppercase "H".
docs/source/loading.rst
Outdated
| Load | ||
| ==== | ||
|
|
||
| You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training. | |
| You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a GitHub repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A dataset can be on your local machine ==> A dataset can be on disk on your local machine
- and in data structures like Python dictionaries and Pandas DataFrames ==>
- or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
OR: - or in in-memory data structures like Python dictionaries and Pandas DataFrames
- or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somme additional comment, suggestions and questions.
docs/source/how_to.md
Outdated
| * How to compute metrics. | ||
|
|
||
| * How to upload and share a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the logical order and the same order as in the navigation.
| * How to compute metrics. | |
| * How to upload and share a dataset. | |
| * How to upload and share a dataset. | |
| * How to compute metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adding a mention to the new subsections?
- how to create a dataset loading script,
- how to create a dataset card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest also renaming the new subsections, by adding a verb in their title: note that all belong to the How-to section: hot to load a dataset, hot to process a dataset, how to stream a dataset,...
- Dataset script ==> Create dataset loading script
- Dataset card ==> Create dataset card
docs/source/loading.rst
Outdated
| Load | ||
| ==== | ||
|
|
||
| You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on your local machine, in a Github repository, and in data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, 🤗 Datasets provides a way for you to load and use it for training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A dataset can be on your local machine ==> A dataset can be on disk on your local machine
- and in data structures like Python dictionaries and Pandas DataFrames ==>
- or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
OR: - or in in-memory data structures like Python dictionaries and Pandas DataFrames
- or in your local machine RAM in data structures like Python dictionaries and Pandas DataFrames
docs/source/dataset_script.rst
Outdated
| self.citation = citation | ||
| self.url = url | ||
| 2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant with previous item. There we already had to subclass BuilderConfig in order to add additional attributes, like features, label_classes and citation. Note that the only attributes of base BuilderConfig are: name, version, data_dir, data_files and description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify the name and description of each configuration. These sub-classes should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`: | |
| 2. Sub-class the base :class:`datasets.BuilderConfig` to add additional attributes of a configuration. This gives you more flexibility to specify all the name and description of each configuration. These sub-class instances should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this point is about creating the config instances and list them in BUILDER_CONFIGS.
Let me try to rephrase it
Co-authored-by: Albert Villanova del Moral <[email protected]>
Co-authored-by: Albert Villanova del Moral <[email protected]>
|
Thanks a lot for all the suggestions ! I'm doing the final changes based on the remaining comments, then we can merge and release v1.12 of |
|
Alright I think I took all the suggestions and comments into account :) |
Organize Datasets documentation into four documentation types to improve clarity and discoverability of content.
Content to add in the very short term (feel free to add anything I'm missing):