Docs for creating an audio dataset #4872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

lhoestq merged 18 commits into huggingface:main from stevhliu:create-audio-datasets

Sep 21, 2022

Member

stevhliu commented Aug 23, 2022

This PR is a first draft of how to create audio datasets (AudioFolder and loading script). Feel free to let me know if there are any specificities I'm missing for this. 🙂


           📝 add docs for creating audio dataset

211f092

stevhliu added the documentation label

stevhliu requested review from albertvillanova, lhoestq, mariosasko and polinaeterna

August 23, 2022 01:07

HuggingFaceDocBuilderDev commented Aug 23, 2022 •

edited

Loading

The documentation is not available anymore as the PR was closed or merged.

Member

lhoestq commented Aug 23, 2022 •

edited

Loading

Awesome thanks ! I think we can also encourage TAR archives as for image dataset scripts (feel free to copy paste some parts from there lol)


           🖍 small edits, encourage TAR archives more

19f2add

lhoestq reviewed

View reviewed changes

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

polinaeterna reviewed

View reviewed changes

Contributor

polinaeterna left a comment •

edited

Loading

@stevhliu Thank you so much!! ❤️❤️❤️ I'm so excited about the fact we will finally have a good documentation about audio datasets creation!

I've left some comments and suggestions, feel free to reformulate the suggestions, I tried to explain what I meat in general in each case. :)

Also, I think vivos would be indeed a good additional example to show how to implement streaming of tar files because without an example, sentences about TARs might not be clear to users that do it for the first time.

And I really think it would be great to have a section somewhere about how to implement streaming compatibility for those who don't know what it is and why do we have to use different functions for different types of archive (I've written a detailed comment about that).

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved


           🖍 apply polina feedbacks

0d4e318

albertvillanova mentioned this pull request

Dataset Viewer issue for indonesian-nlp/librivox-indonesia #4934

Closed

polinaeterna mentioned this pull request

vivos (Vietnamese speech corpus) dataset not accessible #4936

Closed

lhoestq reviewed

View reviewed changes

docs/source/audio_dataset_script.mdx Outdated

    
              ### Generate the dataset metadata (optional)

              The dataset metadata you added earlier now needs to be generated and stored in a file called `datasets_infos.json`. In addition to information about a datasets features and description, this file also contains data file checksums to ensure integrity.

Member

lhoestq Sep 6, 2022

FYI in #4926 I'm changing datasets-cli to output the dataset_infos in the YAML metadata: we're on the way of deprecating dataet_infos.json. The metadata include the number of examples and bytes of the dataset per split, as well as the type of each column.

The integrity verification using file checksums can't be done using the YAML metadata only, but the integrity is still checked by verifying the number of generated examples.

You can keep this PR as is, and we can update when #4926 is merged

polinaeterna reviewed

View reviewed changes

docs/source/audio_load.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_load.mdx Outdated Show resolved Hide resolved

docs/source/audio_load.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_script.mdx Outdated

    
              Now that you've added some information about your dataset, the next step is to download the dataset and define the splits.

              1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts:

Contributor

polinaeterna Sep 5, 2022

Suggested change

      
            1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts:
          
            1. Use the [`~DownloadManager.download`] method to download the given URLs. The URLs are replaced with a path to the local files. This method accepts:

Contributor

polinaeterna Sep 6, 2022

first of all, vivos is not available for now :(

secondly, I've been thinking on how to explain things in an easy but still correct way and ended up realizing that it's actually not possible! :D so I have the following suggestions:

use https://huggingface.co/datasets/indonesian-nlp/librivox-indonesia/blob/main/librivox-indonesia.py as an example for now. it's clean and small in size and has both streaming support and local file paths. what do you think? cc @lhoestq
explain in detail, like, really in detail, what's going on here in _split_generators and _generate_examples. I've written a dirty draft on how it can be done here.
audio datasets authors who took the risk to write their own dataset script should be advanced users so I believe it's worth explaining everything just as is straightaway!

let me know what you think! and if you agree on that, @stevhliu feel free to integrate my notes linked in this comment, rewording them as you feel would be better, and also don't hesitate to ping me on slack if you need a quick check or smth, I'll be happy to help! ❤️

Contributor

polinaeterna Sep 7, 2022

another suggestion from @lhoestq !
we first describe vivos-like approach (without extraction), and then add a section which explain how to also make your dataset extracted locally and preserve full paths to local files (with the example I suggested). but that way we need to fix vivos first (we are waiting for their response)

Member Author

stevhliu Sep 9, 2022 •

edited

Loading

I went with suggestion 1 to replace vivos with librivox/indonesia so we can have some nice examples right away without waiting for vivos to be fixed.

Thanks again for your super detailed notes! I think I've integrated all of them, but please let me know what you think and what can be improved for even more clarity 🥰

Contributor

cahya-wirawan Sep 11, 2022

Hi @stevhliu, the documentation looks very good, and thanks for using my dataset as an example, although I just created it a few days ago :-) And since it is new, I might still need to update a few small things, such as the STATS in release_stats.py. I think this will not affect the documentation you are writing. However, I also updated how the
compressed tar file is stored. I split the tar file into audio_train.tgz and audio_test.tgz to get the test data directly if I load the test split in streaming mode. Otherwise, the loading process will first scan all the train files before it arrives to test files which could take some time if the dataset is huge.

lewtun mentioned this pull request

Add note about loading image / audio files to docs #4925

Closed


          audiofolder and metadata first

011a8dd

Member Author

stevhliu commented Sep 7, 2022

Thanks for all the great feedback @polinaeterna and @lhoestq! 🥰

I added all the other feedback, and I'll look into the librivox-indonesia script now!

stevhliu and others added 4 commits

September 7, 2022 16:16


          oops metadata first also in audio load

6af31aa


          replace vivos with librivox indonesia, describe streaming in more detail

ce8c422


          taking over the PR

02c1a85


          Merge branch 'main' into create-audio-datasets

8847bd1

Member

lhoestq commented Sep 19, 2022

If you don't mind, I'm taking over this PR since we'll do a release pretty soon

lhoestq marked this pull request as ready for review

September 19, 2022 16:09

Contributor

polinaeterna commented Sep 20, 2022

@lhoestq no, I do :D

Member

lhoestq commented Sep 20, 2022

haha sorry ^^

polinaeterna added 2 commits

September 20, 2022 15:37


           check if i can push to other's fork don't look at this

584314f


          git back vivos as main example, simplify instructions. add librivox-i…

4aca16c

…ndonesia as an advanced example

lhoestq reviewed

View reviewed changes

Member

lhoestq left a comment •

edited

Loading

Love it thanks !

I know it's still WIP so feel free to ignore my comments if they're not relevant

docs/source/audio_dataset_repo.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_repo.mdx

    
              <Tip>

              The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!

Member

lhoestq Sep 20, 2022

(nit)

Suggested change

      
            The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive!
          
            The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths inside the archive. A TAR archive is just a concatenation of files, so you'll need to download it first and then sequentially iterate over the files within the archive to find the one you want!

docs/source/audio_dataset_repo.mdx Outdated Show resolved Hide resolved

docs/source/audio_dataset_repo.mdx Outdated Show resolved Hide resolved

docs/source/_toctree.yml Outdated

    
                  - local: audio_process

                    title: Process audio data

                  - local: audio_dataset_repo

                    title: Create an audio dataset repo

Member

lhoestq Sep 20, 2022

Suggested change

      
                  title: Create an audio dataset repo
          
                  title: Share an audio dataset

The audio_dataset_repo.mdx can also be renamed share_audio_dataset.mdx if we use this title

Polina Kazakova and others added 8 commits

September 20, 2022 18:42


          Apply some suggestions from code review

5f68d20

Co-authored-by: Quentin Lhoest <[email protected]>


          Update docs/source/audio_dataset_repo.mdx

b4c11e1

Co-authored-by: Quentin Lhoest <[email protected]>


          fix something i don't remember what, integrate changes from huggingfa…

4f91782

…ce#4925


          integrate huggingface#4952 to image docs too


          rename audio and image datasets guides consistently (to audio/image_d…

a169f3a

…ataset.mdx)


          remove outdated doc

d9079c9


          fix audio guide name

6127afa


          fix link + minor changes

0fb3623

lhoestq changed the title ~~[WIP] Docs for creating an audio dataset~~ Docs for creating an audio dataset

lhoestq approved these changes

View reviewed changes

Member

lhoestq left a comment

Thanks @polinaeterna and @stevhliu :)

I'm merging this one to include it in the release

lhoestq merged commit 733e499 into huggingface:main

stevhliu deleted the create-audio-datasets branch

September 22, 2022 17:19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

lhoestq lhoestq approved these changes

albertvillanova Awaiting requested review from albertvillanova

mariosasko Awaiting requested review from mariosasko

+2 more reviewers

cahya-wirawan cahya-wirawan left review comments

polinaeterna polinaeterna left review comments

Labels