Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 46 additions & 32 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,38 +23,52 @@
- sections:
- local: how_to
title: Overview
- local: loading
title: Load
- local: process
title: Process
- local: audio_process
title: Process audio data
- local: image_process
title: Process image data
- local: stream
title: Stream
- local: use_with_tensorflow
title: Use with TensorFlow
- local: use_with_pytorch
title: Use with PyTorch
- local: share
title: Share
- local: dataset_script
title: Create a dataset loading script
- local: dataset_card
title: Create a dataset card
- local: repository_structure
title: Structure your repository
- local: cache
title: Cache management
- local: filesystems
title: Cloud storage
- local: faiss_es
title: Search index
- local: how_to_metrics
title: Metrics
- local: beam
title: Beam Datasets
- sections:
- local: loading
title: Load
- local: process
title: Process
- local: stream
title: Stream
- local: use_with_tensorflow
title: Use with TensorFlow
- local: use_with_pytorch
title: Use with PyTorch
- local: share
title: Share
- local: dataset_script
title: Create a dataset loading script
- local: dataset_card
title: Create a dataset card
- local: repository_structure
title: Structure your repository
- local: cache
title: Cache management
- local: filesystems
title: Cloud storage
- local: faiss_es
title: Search index
- local: how_to_metrics
title: Metrics
- local: beam
title: Beam Datasets
title: "General usage"
- sections:
- local: audio_load
title: Load audio data
- local: audio_process
title: Process audio data
title: "Audio"
- sections:
- local: image_load
title: Load image data
- local: image_process
title: Process image data
title: "Vision"
- sections:
- local: nlp_process
title: Process text data
title: "Text"
title: "How-to guides"
- sections:
- local: about_arrow
Expand Down
49 changes: 49 additions & 0 deletions docs/source/audio_load.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Load audio data

Audio datasets are loaded from the `audio` column, which contains three important fields:

* `array`: the decoded audio data represented as a 1-dimensional array.
* `path`: the path to the downloaded audio file.
* `sampling_rate`: the sampling rate of the audio data.

When you load an audio dataset and call the `audio` column, the [`Audio`] feature automatically decodes and resamples the audio file:

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 8000}
```

<Tip warning={true}>

Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>

## Path

The `path` is useful for loading your own dataset. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and decode it into `array`'s with the [`Audio`] feature:

```py
>>> audio_dataset = audio_dataset.cast_column("paths_to_my_audio_files", Audio())
```

If you only want to load the underlying path to the audio dataset without decoding the audio file into an `array`, set `decode=False` in the [`Audio`] feature:

```py
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column('audio', Audio(decode=False))
>>> dataset[0]
{'audio': {'bytes': None,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
'english_transcription': 'I would like to set up a joint account with my partner',
'intent_class': 11,
'lang_id': 4,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'transcription': 'I would like to set up a joint account with my partner'}
```

128 changes: 32 additions & 96 deletions docs/source/audio_process.mdx
Original file line number Diff line number Diff line change
@@ -1,93 +1,31 @@
# Process audio data

🤗 Datasets supports an [`Audio`] feature, enabling users to load and process raw audio files for training. This guide will show you how to:
This guide shows specific methods for processing audio datasets. Learn how to:

- Load your own custom audio dataset.
- Resample audio files.
- Use [`Dataset.map`] with audio files.
- Resample the sampling rate.
- Use [`~Dataset.map`] with audio datasets.

## Installation
For a guide on processing any type of dataset, take a look at the <a class="bg-pink-200 dark:bg-pink-500 px-1 rounded font-bold underline decoration-pink-900 text-pink-900 dark:bg-pink-500" href="./process">general process guide</a>.

The [`Audio`] feature should be installed as an extra dependency in 🤗 Datasets. Install the [`Audio`] feature (and its dependencies) with pip:
## Cast

```bash
pip install datasets[audio]
```

<Tip warning={true}>

On Linux, non-Python dependency on `libsndfile` package must be installed manually, using your distribution package manager, for example:

```bash
sudo apt-get install libsndfile1
```

</Tip>

To support loading audio datasets containing MP3 files, users should additionally install [torchaudio](https://pytorch.org/audio/stable/index.html), so that audio data is handled with high performance.

```bash
pip install torchaudio
```

<Tip warning={true}>

torchaudio's `sox_io` [backend](https://pytorch.org/audio/stable/backend.html#) supports decoding `mp3` files. Unfortunately, the `sox_io` backend is only available on Linux/macOS, and is not supported by Windows.

</Tip>

Then you can load an audio dataset the same way you would load a text dataset. For example, load the [Common Voice](https://huggingface.co/datasets/common_voice) dataset with the Turkish configuration:

```py
>>> from datasets import load_dataset, load_metric, Audio
>>> common_voice = load_dataset("common_voice", "tr", split="train")
```

## Audio datasets

Audio datasets commonly have an `audio` and `path` or `file` column.

`audio` is the actual audio file that is loaded and resampled on-the-fly upon calling it.

```py
>>> common_voice[0]["audio"]
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-8.8930130e-05, -3.8027763e-05, -2.9146671e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
'sampling_rate': 48000}
```

When you access an audio file, it is automatically decoded and resampled. Generally, you should query an audio file like: `common_voice[0]["audio"]`. If you query an audio file with `common_voice["audio"][0]` instead, **all** the audio files in your dataset will be decoded and resampled. This process can take a long time if you have a large dataset.

`path` or `file` is an absolute path to an audio file.

```py
>>> common_voice[0]["path"]
/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3
```

The `path` is useful if you want to load your own audio dataset. In this case, provide a column of audio file paths to [`Dataset.cast_column`]:
The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:

```py
>>> my_audio_dataset = my_audio_dataset.cast_column("paths_to_my_audio_files", Audio())
```

## Resample
>>> from datasets import load_dataset, Audio

Some models expect the audio data to have a certain sampling rate due to how the model was pretrained. For example, the [XLSR-Wav2Vec2](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model expects the input to have a sampling rate of 16kHz, but an audio file from the Common Voice dataset has a sampling rate of 48kHz. You can use [`Dataset.cast_column`] to downsample the sampling rate to 16kHz:

```py
>>> common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
```

The next time you load the audio file, the [`Audio`] feature will load and resample it to 16kHz:
Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:

```py
>>> common_voice_train[0]["audio"]
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-7.4556941e-05, -1.4621433e-05, -5.7861507e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
'sampling_rate': 16000}
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'sampling_rate': 16000}
```

<div class="flex justify-center">
Expand All @@ -97,31 +35,29 @@ The next time you load the audio file, the [`Audio`] feature will load and resam

## Map

Just like text datasets, you can apply a preprocessing function over an entire dataset with [`Dataset.map`], which is useful for preprocessing all of your audio data at once. Start with a [speech recognition model](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) of your choice, and load a `processor` object that contains:
The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depending on the type of model you're working with, you'll need to either load a [feature extractor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoFeatureExtractor) or a [processor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor).

1. A feature extractor to convert the speech signal to the model's input format. Every speech recognition model on the 🤗 [Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) contains a predefined feature extractor that can be easily loaded with `AutoFeatureExtractor.from_pretrained(...)`.
- For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:

2. A tokenizer to convert the model's output format to text. Fine-tuned speech recognition models, such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), contain a predefined tokenizer that can be easily loaded with `AutoTokenizer.from_pretrained(...)`.
```py
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

For pretrained speech recognition models, such as [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), a tokenizer needs to be created from the target text as explained [here](https://huggingface.co/blog/fine-tune-wav2vec2-english). The following example demonstrates how to load a feature extractor, tokenizer and processor for a pretrained speech recognition model:
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
# after defining a vocab.json file you can instantiate a tokenizer object:
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
>>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
```

```py
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, Wav2Vec2Processor
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
>>> # after defining a vocab.json file you can instantiate a tokenizer object:
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
>>> processor = Wav2Vec2Processor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
```
- For fine-tuned speech recognition models, you only need to load a `processor`:

For fine-tuned speech recognition models, you can simply load a predefined `processor` object with:
```py
>>> from transformers import AutoProcessor

```py
>>> from transformers import Wav2Vec2Processor
>>> processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
```
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
```

Make sure to include the `audio` key in your preprocessing function when you call [`Dataset.map`] so that you are actually resampling the audio data:
When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:

```py
>>> def prepare_dataset(batch):
Expand All @@ -131,5 +67,5 @@ Make sure to include the `audio` key in your preprocessing function when you cal
... with processor.as_target_processor():
... batch["labels"] = processor(batch["sentence"]).input_ids
... return batch
>>> common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)
```
35 changes: 19 additions & 16 deletions docs/source/how_to.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
# Overview

Our how-to guides will show you how to complete a specific task. These guides are intended to help you apply your knowledge of 🤗 Datasets to real-world problems you may encounter. Want to flatten a column or load a dataset from a local file? We got you covered! You should already be familiar and comfortable with the 🤗 Datasets basics, and if you aren't, we recommend reading our [tutorial](./tutorial) first.
The how-to guides offer a more comprehensive overview of all the tools 🤗 Datasets offers and how to use them. This will help you tackle some of the messier real-world datasets, where you may need to manipulate the dataset structure or content to get it ready for training.

The how-to guides will cover eight key areas of 🤗 Datasets:
The guides assume you are familiar and comfortable with the 🤗 Datasets basics. We recommend newer users check out our [tutorials](tutorial) first.

* How to load a dataset from other data sources.
<Tip>

* How to process a dataset.
Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course!

* How to use a dataset with your favorite ML/DL framework.
</Tip>

* How to stream large datasets.
The guides cover four key areas of 🤗 Datasets:

* How to upload and share a dataset.
<div>
<span class="bg-pink-200 text-pink-900 dark:bg-pink-500 px-1 rounded font-bold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
</div>
<div>
<span class="bg-yellow-200 text-yellow-900 dark:bg-yellow-500 px-1 rounded font-bold">Audio</span>: How to load, process, and share audio datasets.
</div>
<div>
<span class="bg-green-200 text-green-900 dark:bg-green-500 px-1 rounded font-bold">Vision</span>: How to load, process, and share image datasets.
</div>
<div>
<span class="bg-blue-200 text-blue-900 dark:bg-blue-500 px-1 rounded font-bold">NLP</span>: How to load, process, and share NLP datasets.
</div>

* How to create a dataset loading script.

* How to create a dataset card.

* How to compute metrics.

* How to manage the cache.

You can also find guides on how to process massive datasets with Beam, how to integrate with cloud storage providers, and how to add an index to search your dataset.
If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).
6 changes: 6 additions & 0 deletions docs/source/how_to_metrics.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Metrics

<Tip warning={true}>

Metrics will soon be deprecated in 🤗 Datasets. To learn more about how to use metrics, take a look at our newest library 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index)! In addition to metrics, we've also added more tools for evaluating models and datasets.

</Tip>

Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set. You have also seen how to load a metric.

This guide will show you how to:
Expand Down
Loading