Skip to content

Conversation

@stevhliu
Copy link
Member

This PR is a first draft of how to create audio datasets (AudioFolder and loading script). Feel free to let me know if there are any specificities I'm missing for this. 🙂

@stevhliu stevhliu added the documentation Improvements or additions to documentation label Aug 23, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 23, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member

lhoestq commented Aug 23, 2022

Awesome thanks ! I think we can also encourage TAR archives as for image dataset scripts (feel free to copy paste some parts from there lol)

Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevhliu Thank you so much!! ❤️❤️❤️ I'm so excited about the fact we will finally have a good documentation about audio datasets creation!

I've left some comments and suggestions, feel free to reformulate the suggestions, I tried to explain what I meat in general in each case. :)

Also, I think vivos would be indeed a good additional example to show how to implement streaming of tar files because without an example, sentences about TARs might not be clear to users that do it for the first time.

And I really think it would be great to have a section somewhere about how to implement streaming compatibility for those who don't know what it is and why do we have to use different functions for different types of archive (I've written a detailed comment about that).


### Generate the dataset metadata (optional)

The dataset metadata you added earlier now needs to be generated and stored in a file called `datasets_infos.json`. In addition to information about a datasets features and description, this file also contains data file checksums to ensure integrity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in #4926 I'm changing datasets-cli to output the dataset_infos in the YAML metadata: we're on the way of deprecating dataet_infos.json. The metadata include the number of examples and bytes of the dataset per split, as well as the type of each column.

The integrity verification using file checksums can't be done using the YAML metadata only, but the integrity is still checked by verifying the number of generated examples.

You can keep this PR as is, and we can update when #4926 is merged


Now that you've added some information about your dataset, the next step is to download the dataset and define the splits.

1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts:
1. Use the [`~DownloadManager.download`] method to download the given URLs. The URLs are replaced with a path to the local files. This method accepts:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first of all, vivos is not available for now :(

secondly, I've been thinking on how to explain things in an easy but still correct way and ended up realizing that it's actually not possible! :D so I have the following suggestions:

  1. use https://huggingface.co/datasets/indonesian-nlp/librivox-indonesia/blob/main/librivox-indonesia.py as an example for now. it's clean and small in size and has both streaming support and local file paths. what do you think? cc @lhoestq
  2. explain in detail, like, really in detail, what's going on here in _split_generators and _generate_examples. I've written a dirty draft on how it can be done here.
    audio datasets authors who took the risk to write their own dataset script should be advanced users so I believe it's worth explaining everything just as is straightaway!

let me know what you think! and if you agree on that, @stevhliu feel free to integrate my notes linked in this comment, rewording them as you feel would be better, and also don't hesitate to ping me on slack if you need a quick check or smth, I'll be happy to help! ❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another suggestion from @lhoestq !
we first describe vivos-like approach (without extraction), and then add a section which explain how to also make your dataset extracted locally and preserve full paths to local files (with the example I suggested). but that way we need to fix vivos first (we are waiting for their response)

Copy link
Member Author

@stevhliu stevhliu Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with suggestion 1 to replace vivos with librivox/indonesia so we can have some nice examples right away without waiting for vivos to be fixed.

Thanks again for your super detailed notes! I think I've integrated all of them, but please let me know what you think and what can be improved for even more clarity 🥰

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stevhliu, the documentation looks very good, and thanks for using my dataset as an example, although I just created it a few days ago :-) And since it is new, I might still need to update a few small things, such as the STATS in release_stats.py. I think this will not affect the documentation you are writing. However, I also updated how the
compressed tar file is stored. I split the tar file into audio_train.tgz and audio_test.tgz to get the test data directly if I load the test split in streaming mode. Otherwise, the loading process will first scan all the train files before it arrives to test files which could take some time if the dataset is huge.

@stevhliu
Copy link
Member Author

stevhliu commented Sep 7, 2022

Thanks for all the great feedback @polinaeterna and @lhoestq! 🥰

I added all the other feedback, and I'll look into the librivox-indonesia script now!

@lhoestq
Copy link
Member

lhoestq commented Sep 19, 2022

If you don't mind, I'm taking over this PR since we'll do a release pretty soon

@lhoestq lhoestq marked this pull request as ready for review September 19, 2022 16:09
@polinaeterna
Copy link
Contributor

@lhoestq no, I do :D

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2022

haha sorry ^^

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it thanks !

I know it's still WIP so feel free to ignore my comments if they're not relevant


<Tip>

The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!
The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths inside the archive. A TAR archive is just a concatenation of files, so you'll need to download it first and then sequentially iterate over the files within the archive to find the one you want!

- local: audio_process
title: Process audio data
- local: audio_dataset_repo
title: Create an audio dataset repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: Create an audio dataset repo
title: Share an audio dataset

The audio_dataset_repo.mdx can also be renamed share_audio_dataset.mdx if we use this title

@lhoestq lhoestq changed the title [WIP] Docs for creating an audio dataset Docs for creating an audio dataset Sep 21, 2022
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @polinaeterna and @stevhliu :)

I'm merging this one to include it in the release

@lhoestq lhoestq merged commit 733e499 into huggingface:main Sep 21, 2022
@stevhliu stevhliu deleted the create-audio-datasets branch September 22, 2022 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants