Skip to content

Conversation

@stevhliu
Copy link
Member

@stevhliu stevhliu commented Jun 16, 2022

This PR creates separate sections in the guides for audio, vision, text, and general usage so it is easier for users to find loading, processing, or sharing guides specific to the dataset type they're working with. It'll also allow us to scale the docs to additional dataset types - like time series, tabular, etc. - while keeping our docs information architecture.

Some other changes include:

  • Experimented with decorating text with some CSS to highlight guides specific to each modality. Hopefully, it'll be easier for users to find and realize that these different docs exist! Will experiment with this in a different PR.
  • Added deprecation warning for Metrics and redirect to Evaluate.
  • Updated set_format section to recommend using the new to_tf_dataset function if you need to convert to a TensorFlow dataset.
  • Reorganized toctree to nest general usage, audio, vision, and text sections under the how-to guides.
  • A quick review and edit to the Load and Process docs for clarity.

@stevhliu stevhliu added the documentation Improvements or additions to documentation label Jun 16, 2022
@stevhliu stevhliu requested review from lhoestq and mariosasko June 16, 2022 21:38
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 16, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

We can also add the "Load text data" page, currently it feels weird to not have it ;)

In particular users can do load_dataset("text", data_dir=...) or load_dataset("text", data_files=...) and they can use grep patterns to select several files.

The "text" loaders has a few parameters. The main parameter is sample_by. By default sample_by="line", so one example = one line from the text files, but you can change it to "paragraph" or "document"


A bit unrelated, but I feel like those pages can also be grouped together in a "Dataset repository" section:

  • Share
  • Create a dataset loading script
  • Create a dataset card
  • Structure your repository

This way we can decouple "General usage" (how to use datasets) from "Dataset repositories" (how to create a repository)

@stevhliu stevhliu marked this pull request as ready for review June 24, 2022 21:58
@stevhliu
Copy link
Member Author

Ready for review!

The toctree is a bit longer now with the sections. I think if we keep the audio/vision/text/dataset repository sections collapsed by default, and keep the general usage expanded, it may look a little cleaner and not as overwhelming. Let me know what you think! 😄

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Just one nit.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! I think it's ok to leave the new sections uncollapsed though, as you want

@stevhliu stevhliu merged commit 28946e2 into huggingface:main Jul 7, 2022
@stevhliu stevhliu deleted the reorg-structure branch July 7, 2022 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants