Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ea9fda8
passes all but 1 test case
TyTodd Jun 12, 2025
7be0dcf
Migrated Audio feature to use torchcodec as a backend. Fixed how form…
TyTodd Jun 13, 2025
c0d3fce
fixed audio and video features so they now pass the test_dataset_with…
TyTodd Jun 13, 2025
12511a3
added load dataset test case to test_video.py
TyTodd Jun 13, 2025
72f3ade
Modified documentation to document new torchcodec implementation of V…
TyTodd Jun 13, 2025
c1843c3
code formatting for torchcodec changes
TyTodd Jun 14, 2025
8b29d61
Merge branch 'main' into torchcodec-decoding
TyTodd Jun 14, 2025
c4a1ac0
Merge branch 'main' into torchcodec-decoding
TyTodd Jun 17, 2025
4dfff64
Merge branch 'main' into torchcodec-decoding
lhoestq Jun 17, 2025
e8b68e5
Update src/datasets/features/audio.py
TyTodd Jun 17, 2025
e9a4a14
added backwards compatibility support and _hf_encoded for Audio feature.
TyTodd Jun 17, 2025
6c0e425
move AudioDecoder to its own file
lhoestq Jun 18, 2025
e74a9ee
naming
lhoestq Jun 18, 2025
28e0173
docs
lhoestq Jun 18, 2025
c50c505
style
lhoestq Jun 18, 2025
806a4ba
update tests
lhoestq Jun 19, 2025
f5a53c4
Merge branch 'main' into torchcodec-decoding
lhoestq Jun 19, 2025
3ee5f90
no torchcodec for windows
lhoestq Jun 19, 2025
eb6324c
further cleaning
lhoestq Jun 19, 2025
8a1e0bc
fix
lhoestq Jun 19, 2025
661b574
install ffmpeg in ci
lhoestq Jun 19, 2025
8036265
fix ffmpeg installation
lhoestq Jun 19, 2025
b582c5b
fix mono backward compatibility
lhoestq Jun 19, 2025
4e265db
fix ffmpeg
lhoestq Jun 19, 2025
f043c0c
again
lhoestq Jun 19, 2025
37763db
fix mono backward compat
lhoestq Jun 19, 2025
5198748
fix tests
lhoestq Jun 19, 2025
f06ef21
fix tests
lhoestq Jun 19, 2025
4a637bd
again
lhoestq Jun 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 8 additions & 11 deletions docs/source/about_dataset_features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ See the [flatten](./process#flatten) section to learn how you can extract the ne

</Tip>

The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].
The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].

```py
>>> features = Features({'a': Array2D(shape=(1, 3), dtype='int32')})
Expand All @@ -69,9 +69,9 @@ The array type also allows the first dimension of the array to be dynamic. This

Audio datasets have a column with type [`Audio`], which contains three important fields:

* `array`: the decoded audio data represented as a 1-dimensional array.
* `path`: the path to the downloaded audio file.
* `sampling_rate`: the sampling rate of the audio data.
- `array`: the decoded audio data represented as a 1-dimensional array.
- `path`: the path to the downloaded audio file.
- `sampling_rate`: the sampling rate of the audio data.

When you load an audio dataset and call the audio column, the [`Audio`] feature automatically decodes and resamples the audio file:

Expand All @@ -80,10 +80,7 @@ When you load an audio dataset and call the audio column, the [`Audio`] feature

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 8000}
<torchcodec.decoders._audio_decoder.AudioDecoder object at 0x11642b6a0>
```

<Tip warning={true}>
Expand All @@ -92,7 +89,7 @@ Index into an audio dataset using the row index first and then the `audio` colum

</Tip>

With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an `array`,
With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,

```py
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
Expand Down Expand Up @@ -126,7 +123,7 @@ Index into an image dataset using the row index first and then the `image` colum

</Tip>

With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,
With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,

```py
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
Expand All @@ -146,4 +143,4 @@ You can also define a dataset of images from numpy arrays:
And in this case the numpy arrays are encoded into PNG (or TIFF if the pixels values precision is important).

For multi-channels arrays like RGB or RGBA, only uint8 is supported. If you use a larger precision, you get a warning and the array is downcasted to uint8.
For gray-scale images you can use the integer or float precision you want as long as it is compatible with `Pillow`. A warning is shown if your image integer or float precision is too high, and in this case the array is downcated: an int64 array is downcasted to int32, and a float64 array is downcasted to float32.
For gray-scale images you can use the integer or float precision you want as long as it is compatible with `Pillow`. A warning is shown if your image integer or float precision is too high, and in this case the array is downcated: an int64 array is downcasted to int32, and a float64 array is downcasted to float32.
12 changes: 3 additions & 9 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,9 @@ dataset = load_dataset("<username>/my_dataset")

There are several methods for creating and sharing an audio dataset:

* Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.

* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
- Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.

- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

<Tip>

Expand All @@ -28,10 +27,7 @@ You can load your own dataset using the paths to your audio files. Use the [`~Da
```py
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': 'path/to/audio_1',
'sampling_rate': 16000}
<torchcodec.decoders._audio_decoder.AudioDecoder object at 0x11642b6a0>
```

Then upload the dataset to the Hugging Face Hub using [`Dataset.push_to_hub`]:
Expand All @@ -51,7 +47,6 @@ my_dataset/

## AudioFolder


The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.

<Tip>
Expand Down Expand Up @@ -101,7 +96,6 @@ If all audio files are contained in a single directory or if they are not on the

</Tip>


If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.

```
Expand Down
8 changes: 2 additions & 6 deletions docs/source/audio_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,14 @@ Audio decoding is based on the [`soundfile`](https://github.com/bastibe/python-s
To work with audio datasets, you need to have the `audio` dependencies installed.
Check out the [installation](./installation#audio) guide to learn how to install it.


## Local files

You can load your own dataset using the paths to your audio files. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and cast it to the [`Audio`] feature:

```py
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': 'path/to/audio_1',
'sampling_rate': 16000}
<torchcodec.decoders._audio_decoder.AudioDecoder object at 0x11642b6a0>
```

## AudioFolder
Expand Down Expand Up @@ -99,7 +95,7 @@ For a guide on how to load any type of dataset, take a look at the <a class="und

## Audio decoding

By default, audio files are decoded sequentially as NumPy arrays when you iterate on a dataset.
By default, audio files are decoded sequentially as torchcodec [`AudioDecoder`](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder) objects when you iterate on a dataset.
However it is possible to speed up the dataset significantly using multithreaded decoding:

```python
Expand Down
47 changes: 26 additions & 21 deletions docs/source/audio_process.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ This guide shows specific methods for processing audio datasets. Learn how to:

For a guide on how to process any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./process">general process guide</a>.


## Cast

The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:
Expand All @@ -22,16 +21,22 @@ The [`~Dataset.cast_column`] function is used to cast a column to another featur
Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:

```py
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 16000}
>>> ad = dataset[0]["audio"]
<torchcodec.decoders._audio_decoder.AudioDecoder object at 0x11642b6a0>
>>> ad = audio_dataset[0]["audio"]
>>> ad.get_all_samples().sample_rate
16000
```

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample.gif"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample-dark.gif"/>
<img
class="block dark:hidden"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample.gif"
/>
<img
class="hidden dark:block"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample-dark.gif"
/>
</div>

## Map
Expand All @@ -40,30 +45,30 @@ The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depe

- For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:

```py
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
```py
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
# after defining a vocab.json file you can instantiate a tokenizer object:
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
>>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
```
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
# after defining a vocab.json file you can instantiate a tokenizer object:
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
>>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
```

- For fine-tuned speech recognition models, you only need to load a `processor`:

```py
>>> from transformers import AutoProcessor
```py
>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:

```py
>>> def prepare_dataset(batch):
... audio = batch["audio"]
... batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
... batch["input_values"] = processor(audio.get_all_samples().data, sampling_rate=audio["sampling_rate"]).input_values[0]
... batch["input_length"] = len(batch["input_values"])
... with processor.as_target_processor():
... batch["labels"] = processor(batch["sentence"]).input_ids
Expand Down
12 changes: 6 additions & 6 deletions docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Sometimes, you may need to create a dataset if you're working with your own data

In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for creating all types of datasets:

* Folder-based builders for quickly creating an image or audio dataset
* `from_` methods for creating datasets from local files
- Folder-based builders for quickly creating an image or audio dataset
- `from_` methods for creating datasets from local files

## File-based builders

Expand All @@ -24,10 +24,10 @@ To get the list of supported formats and code examples, follow this guide [here]

There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:

* [`ImageFolder`] uses the [`~datasets.Image`] feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
* [`AudioFolder`] uses the [`~datasets.Audio`] feature to decode an audio file. Audio extensions such as wav and mp3 are supported, and you can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/audiofolder/audiofolder.py#L39) of supported audio extensions.
- [`ImageFolder`] uses the [`~datasets.Image`] feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
- [`AudioFolder`] uses the [`~datasets.Audio`] feature to decode an audio file. Extensions such as wav, mp3, and even mp4 are supported, and you can check the complete [list](https://ffmpeg.org/ffmpeg-formats.html) of supported audio extensions. Decoding is done via ffmpeg.

The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.
The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.

For example, if your image dataset (it is the same for an audio dataset) is stored like this:

Expand All @@ -44,7 +44,7 @@ pokemon/test/water/wartortle.png
Then this is how the folder-based builder generates an example:

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png"/>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png" />
</div>

Create the image dataset by specifying `imagefolder` in [`load_dataset`]:
Expand Down
14 changes: 1 addition & 13 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ You should install 🤗 Datasets in a [virtual environment](https://docs.python.
```bash
# Activate the virtual environment
source .env/bin/activate

# Deactivate the virtual environment
source .env/bin/deactivate
```
Expand Down Expand Up @@ -65,18 +65,6 @@ To work with audio datasets, you need to install the [`Audio`] feature as an ext
pip install datasets[audio]
```

<Tip warning={true}>

To decode mp3 files, you need to have at least version 1.1.0 of the `libsndfile` system library. Usually, it's bundled with the python [`soundfile`](https://github.com/bastibe/python-soundfile) package, which is installed as an extra audio dependency for 🤗 Datasets.
For Linux, the required version of `libsndfile` is bundled with `soundfile` starting from version 0.12.0. You can run the following command to determine which version of `libsndfile` is being used by `soundfile`:

```bash
python -c "import soundfile; print(soundfile.__libsndfile_version__)"
```

</Tip>


## Vision

To work with image datasets, you need to install the [`Image`] feature as an extra dependency:
Expand Down
Loading