Skip to content

Commit 733e499

Browse files
stevhliulhoestqpolinaeterna
authored
Docs for creating an audio dataset (#4872)
* 📝 add docs for creating audio dataset * 🖍 small edits, encourage TAR archives more * 🖍 apply polina feedbacks * audiofolder and metadata first * oops metadata first also in audio load * replace vivos with librivox indonesia, describe streaming in more detail * taking over the PR * check if i can push to other's fork don't look at this * git back vivos as main example, simplify instructions. add librivox-indonesia as an advanced example * Apply some suggestions from code review Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/source/audio_dataset_repo.mdx Co-authored-by: Quentin Lhoest <[email protected]> * fix something i don't remember what, integrate changes from #4925 * integrate #4952 to image docs too * rename audio and image datasets guides consistently (to audio/image_dataset.mdx) * remove outdated doc * fix audio guide name * fix link + minor changes Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: polinaeterna <[email protected]>
1 parent 1a9385d commit 733e499

File tree

6 files changed

+702
-96
lines changed

6 files changed

+702
-96
lines changed

docs/source/_toctree.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,13 +50,15 @@
5050
title: Load audio data
5151
- local: audio_process
5252
title: Process audio data
53+
- local: audio_dataset
54+
title: Create an audio dataset
5355
title: "Audio"
5456
- sections:
5557
- local: image_load
5658
title: Load image data
5759
- local: image_process
5860
title: Process image data
59-
- local: image_dataset_script
61+
- local: image_dataset
6062
title: Create an image dataset
6163
- local: image_classification
6264
title: Image classification

docs/source/about_dataset_features.mdx

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,52 @@ See the [flatten](./process#flatten) section to learn how you can extract the ne
5656
The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].
5757

5858
```py
59-
>>> features = Features({'a': Array2D(shape=(1, 3), dtype='int32'))
59+
>>> features = Features({'a': Array2D(shape=(1, 3), dtype='int32')})
6060
```
6161

6262
The array type also allows the first dimension of the array to be dynamic. This is useful for handling sequences with variable lengths such as sentences, without having to pad or truncate the input to a uniform shape.
6363

6464
```py
6565
>>> features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})
6666
```
67+
68+
# The Audio type
69+
70+
Audio datasets have a column with type [`Audio`], which contains three important fields:
71+
72+
* `array`: the decoded audio data represented as a 1-dimensional array.
73+
* `path`: the path to the downloaded audio file.
74+
* `sampling_rate`: the sampling rate of the audio data.
75+
76+
When you load an audio dataset and call the audio column, the [`Audio`] feature automatically decodes and resamples the audio file:
77+
78+
```py
79+
>>> from datasets import load_dataset, Audio
80+
81+
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
82+
>>> dataset[0]["audio"]
83+
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
84+
0. , 0. ], dtype=float32),
85+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
86+
'sampling_rate': 8000}
87+
```
88+
89+
<Tip warning={true}>
90+
91+
Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
92+
93+
</Tip>
94+
95+
With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an `array`,
96+
97+
```py
98+
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
99+
>>> dataset[0]
100+
{'audio': {'bytes': None,
101+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
102+
'english_transcription': 'I would like to set up a joint account with my partner',
103+
'intent_class': 11,
104+
'lang_id': 4,
105+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
106+
'transcription': 'I would like to set up a joint account with my partner'}
107+
```

0 commit comments

Comments
 (0)