-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs for creating an audio dataset #4872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Awesome thanks ! I think we can also encourage TAR archives as for image dataset scripts (feel free to copy paste some parts from there lol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevhliu Thank you so much!! ❤️❤️❤️ I'm so excited about the fact we will finally have a good documentation about audio datasets creation!
I've left some comments and suggestions, feel free to reformulate the suggestions, I tried to explain what I meat in general in each case. :)
Also, I think vivos would be indeed a good additional example to show how to implement streaming of tar files because without an example, sentences about TARs might not be clear to users that do it for the first time.
And I really think it would be great to have a section somewhere about how to implement streaming compatibility for those who don't know what it is and why do we have to use different functions for different types of archive (I've written a detailed comment about that).
docs/source/audio_dataset_script.mdx
Outdated
|
|
||
| ### Generate the dataset metadata (optional) | ||
|
|
||
| The dataset metadata you added earlier now needs to be generated and stored in a file called `datasets_infos.json`. In addition to information about a datasets features and description, this file also contains data file checksums to ensure integrity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI in #4926 I'm changing datasets-cli to output the dataset_infos in the YAML metadata: we're on the way of deprecating dataet_infos.json. The metadata include the number of examples and bytes of the dataset per split, as well as the type of each column.
The integrity verification using file checksums can't be done using the YAML metadata only, but the integrity is still checked by verifying the number of generated examples.
You can keep this PR as is, and we can update when #4926 is merged
docs/source/audio_dataset_script.mdx
Outdated
|
|
||
| Now that you've added some information about your dataset, the next step is to download the dataset and define the splits. | ||
|
|
||
| 1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. Use the [`~DownloadManager.download`] method to download and extract the given URLs. The URLs are replaced with a path to the local files. This method accepts: | |
| 1. Use the [`~DownloadManager.download`] method to download the given URLs. The URLs are replaced with a path to the local files. This method accepts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first of all, vivos is not available for now :(
secondly, I've been thinking on how to explain things in an easy but still correct way and ended up realizing that it's actually not possible! :D so I have the following suggestions:
- use https://huggingface.co/datasets/indonesian-nlp/librivox-indonesia/blob/main/librivox-indonesia.py as an example for now. it's clean and small in size and has both streaming support and local file paths. what do you think? cc @lhoestq
- explain in detail, like, really in detail, what's going on here in
_split_generatorsand_generate_examples. I've written a dirty draft on how it can be done here.
audio datasets authors who took the risk to write their own dataset script should be advanced users so I believe it's worth explaining everything just as is straightaway!
let me know what you think! and if you agree on that, @stevhliu feel free to integrate my notes linked in this comment, rewording them as you feel would be better, and also don't hesitate to ping me on slack if you need a quick check or smth, I'll be happy to help! ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another suggestion from @lhoestq !
we first describe vivos-like approach (without extraction), and then add a section which explain how to also make your dataset extracted locally and preserve full paths to local files (with the example I suggested). but that way we need to fix vivos first (we are waiting for their response)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with suggestion 1 to replace vivos with librivox/indonesia so we can have some nice examples right away without waiting for vivos to be fixed.
Thanks again for your super detailed notes! I think I've integrated all of them, but please let me know what you think and what can be improved for even more clarity 🥰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @stevhliu, the documentation looks very good, and thanks for using my dataset as an example, although I just created it a few days ago :-) And since it is new, I might still need to update a few small things, such as the STATS in release_stats.py. I think this will not affect the documentation you are writing. However, I also updated how the
compressed tar file is stored. I split the tar file into audio_train.tgz and audio_test.tgz to get the test data directly if I load the test split in streaming mode. Otherwise, the loading process will first scan all the train files before it arrives to test files which could take some time if the dataset is huge.
|
Thanks for all the great feedback @polinaeterna and @lhoestq! 🥰 I added all the other feedback, and I'll look into the |
|
If you don't mind, I'm taking over this PR since we'll do a release pretty soon |
|
@lhoestq no, I do :D |
|
haha sorry ^^ |
…ndonesia as an advanced example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it thanks !
I know it's still WIP so feel free to ignore my comments if they're not relevant
|
|
||
| <Tip> | ||
|
|
||
| The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
| The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths. Instead, you'll need to download it first and then sequentially iterate over the files within the archive! | |
| The reason you need to use a combination of [`~DownloadManager.download`] and [`~DownloadManager.iter_archive`] is because data in TAR archives can't be accessed directly from their paths inside the archive. A TAR archive is just a concatenation of files, so you'll need to download it first and then sequentially iterate over the files within the archive to find the one you want! |
docs/source/_toctree.yml
Outdated
| - local: audio_process | ||
| title: Process audio data | ||
| - local: audio_dataset_repo | ||
| title: Create an audio dataset repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| title: Create an audio dataset repo | |
| title: Share an audio dataset |
The audio_dataset_repo.mdx can also be renamed share_audio_dataset.mdx if we use this title
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @polinaeterna and @stevhliu :)
I'm merging this one to include it in the release
This PR is a first draft of how to create audio datasets (
AudioFolderand loading script). Feel free to let me know if there are any specificities I'm missing for this. 🙂