Skip to content

Commit 28946e2

Browse files
stevhliuMishig Davaadorj
andauthored
Create new sections for audio and vision in guides (#4519)
* 📝 first draft * 📝 create modality specific pages * 📝 create NLP section * 📝 update set_format section * 📝 add use tf/torch to toctree * 🖍 remove visual cues * 🖍 apply quentin review * 🖍 minor edits * 🖍 apply mario review * 🖍 collapse some sections * 🖍 try collapse again * Update _toctree.yml * 🖍 collapse all nested sections except for general usage * 🖍 add link to install dependencies for audio/vision sections * ✨ add text decoration for different guides * 🖍 remove text decorations for now Co-authored-by: Mishig Davaadorj <[email protected]>
1 parent e662d75 commit 28946e2

File tree

12 files changed

+577
-558
lines changed

12 files changed

+577
-558
lines changed

docs/source/_toctree.yml

Lines changed: 50 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -23,38 +23,56 @@
2323
- sections:
2424
- local: how_to
2525
title: Overview
26-
- local: loading
27-
title: Load
28-
- local: process
29-
title: Process
30-
- local: audio_process
31-
title: Process audio data
32-
- local: image_process
33-
title: Process image data
34-
- local: stream
35-
title: Stream
36-
- local: use_with_tensorflow
37-
title: Use with TensorFlow
38-
- local: use_with_pytorch
39-
title: Use with PyTorch
40-
- local: share
41-
title: Share
42-
- local: dataset_script
43-
title: Create a dataset loading script
44-
- local: dataset_card
45-
title: Create a dataset card
46-
- local: repository_structure
47-
title: Structure your repository
48-
- local: cache
49-
title: Cache management
50-
- local: filesystems
51-
title: Cloud storage
52-
- local: faiss_es
53-
title: Search index
54-
- local: how_to_metrics
55-
title: Metrics
56-
- local: beam
57-
title: Beam Datasets
26+
- sections:
27+
- local: loading
28+
title: Load
29+
- local: process
30+
title: Process
31+
- local: stream
32+
title: Stream
33+
- local: use_with_tensorflow
34+
title: Use with TensorFlow
35+
- local: use_with_pytorch
36+
title: Use with PyTorch
37+
- local: cache
38+
title: Cache management
39+
- local: filesystems
40+
title: Cloud storage
41+
- local: faiss_es
42+
title: Search index
43+
- local: how_to_metrics
44+
title: Metrics
45+
- local: beam
46+
title: Beam Datasets
47+
title: "General usage"
48+
- sections:
49+
- local: audio_load
50+
title: Load audio data
51+
- local: audio_process
52+
title: Process audio data
53+
title: "Audio"
54+
- sections:
55+
- local: image_load
56+
title: Load image data
57+
- local: image_process
58+
title: Process image data
59+
title: "Vision"
60+
- sections:
61+
- local: nlp_load
62+
title: Load text data
63+
- local: nlp_process
64+
title: Process text data
65+
title: "Text"
66+
- sections:
67+
- local: share
68+
title: Share
69+
- local: dataset_script
70+
title: Create a dataset loading script
71+
- local: dataset_card
72+
title: Create a dataset card
73+
- local: repository_structure
74+
title: Structure your repository
75+
title: "Dataset repository"
5876
title: "How-to guides"
5977
- sections:
6078
- local: about_arrow

docs/source/audio_load.mdx

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Load audio data
2+
3+
Audio datasets are loaded from the `audio` column, which contains three important fields:
4+
5+
* `array`: the decoded audio data represented as a 1-dimensional array.
6+
* `path`: the path to the downloaded audio file.
7+
* `sampling_rate`: the sampling rate of the audio data.
8+
9+
<Tip>
10+
11+
To work with audio datasets, you need to have the `audio` dependency installed. Check out the [installation](./installation#audio) guide to learn how to install it.
12+
13+
</Tip>
14+
15+
When you load an audio dataset and call the `audio` column, the [`Audio`] feature automatically decodes and resamples the audio file:
16+
17+
```py
18+
>>> from datasets import load_dataset, Audio
19+
20+
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
21+
>>> dataset[0]["audio"]
22+
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
23+
0. , 0. ], dtype=float32),
24+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
25+
'sampling_rate': 8000}
26+
```
27+
28+
<Tip warning={true}>
29+
30+
Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
31+
32+
</Tip>
33+
34+
For a guide on how to load any type of dataset, take a look at the [general loading guide](./loading).
35+
36+
## Local files
37+
38+
The `path` is useful for loading your own dataset. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and decode it into `array`'s with the [`Audio`] feature:
39+
40+
```py
41+
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
42+
```
43+
44+
If you only want to load the underlying path to the audio dataset without decoding the audio file into an `array`, set `decode=False` in the [`Audio`] feature:
45+
46+
```py
47+
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
48+
>>> dataset[0]
49+
{'audio': {'bytes': None,
50+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
51+
'english_transcription': 'I would like to set up a joint account with my partner',
52+
'intent_class': 11,
53+
'lang_id': 4,
54+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
55+
'transcription': 'I would like to set up a joint account with my partner'}
56+
```
57+

docs/source/audio_process.mdx

Lines changed: 32 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,31 @@
11
# Process audio data
22

3-
🤗 Datasets supports an [`Audio`] feature, enabling users to load and process raw audio files for training. This guide will show you how to:
3+
This guide shows specific methods for processing audio datasets. Learn how to:
44

5-
- Load your own custom audio dataset.
6-
- Resample audio files.
7-
- Use [`Dataset.map`] with audio files.
5+
- Resample the sampling rate.
6+
- Use [`~Dataset.map`] with audio datasets.
87

9-
## Installation
8+
For a guide on how to process any type of dataset, take a look at the [general process guide](./process).
109

11-
The [`Audio`] feature should be installed as an extra dependency in 🤗 Datasets. Install the [`Audio`] feature (and its dependencies) with pip:
10+
## Cast
1211

13-
```bash
14-
pip install datasets[audio]
15-
```
16-
17-
<Tip warning={true}>
18-
19-
On Linux, non-Python dependency on `libsndfile` package must be installed manually, using your distribution package manager, for example:
20-
21-
```bash
22-
sudo apt-get install libsndfile1
23-
```
24-
25-
</Tip>
26-
27-
To support loading audio datasets containing MP3 files, users should additionally install [torchaudio](https://pytorch.org/audio/stable/index.html), so that audio data is handled with high performance.
28-
29-
```bash
30-
pip install torchaudio
31-
```
32-
33-
<Tip warning={true}>
34-
35-
torchaudio's `sox_io` [backend](https://pytorch.org/audio/stable/backend.html#) supports decoding `mp3` files. Unfortunately, the `sox_io` backend is only available on Linux/macOS, and is not supported by Windows.
36-
37-
</Tip>
38-
39-
Then you can load an audio dataset the same way you would load a text dataset. For example, load the [Common Voice](https://huggingface.co/datasets/common_voice) dataset with the Turkish configuration:
40-
41-
```py
42-
>>> from datasets import load_dataset, load_metric, Audio
43-
>>> common_voice = load_dataset("common_voice", "tr", split="train")
44-
```
45-
46-
## Audio datasets
47-
48-
Audio datasets commonly have an `audio` and `path` or `file` column.
49-
50-
`audio` is the actual audio file that is loaded and resampled on-the-fly upon calling it.
51-
52-
```py
53-
>>> common_voice[0]["audio"]
54-
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
55-
-8.8930130e-05, -3.8027763e-05, -2.9146671e-05], dtype=float32),
56-
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
57-
'sampling_rate': 48000}
58-
```
59-
60-
When you access an audio file, it is automatically decoded and resampled. Generally, you should query an audio file like: `common_voice[0]["audio"]`. If you query an audio file with `common_voice["audio"][0]` instead, **all** the audio files in your dataset will be decoded and resampled. This process can take a long time if you have a large dataset.
61-
62-
`path` or `file` is an absolute path to an audio file.
63-
64-
```py
65-
>>> common_voice[0]["path"]
66-
/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3
67-
```
68-
69-
The `path` is useful if you want to load your own audio dataset. In this case, provide a column of audio file paths to [`Dataset.cast_column`]:
12+
The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:
7013

7114
```py
72-
>>> my_audio_dataset = my_audio_dataset.cast_column("paths_to_my_audio_files", Audio())
73-
```
74-
75-
## Resample
15+
>>> from datasets import load_dataset, Audio
7616

77-
Some models expect the audio data to have a certain sampling rate due to how the model was pretrained. For example, the [XLSR-Wav2Vec2](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model expects the input to have a sampling rate of 16kHz, but an audio file from the Common Voice dataset has a sampling rate of 48kHz. You can use [`Dataset.cast_column`] to downsample the sampling rate to 16kHz:
78-
79-
```py
80-
>>> common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
17+
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
18+
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
8119
```
8220

83-
The next time you load the audio file, the [`Audio`] feature will load and resample it to 16kHz:
21+
Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:
8422

8523
```py
86-
>>> common_voice_train[0]["audio"]
87-
{'array': array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
88-
-7.4556941e-05, -1.4621433e-05, -5.7861507e-05], dtype=float32),
89-
'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
90-
'sampling_rate': 16000}
24+
>>> dataset[0]["audio"]
25+
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
26+
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
27+
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
28+
'sampling_rate': 16000}
9129
```
9230

9331
<div class="flex justify-center">
@@ -97,31 +35,29 @@ The next time you load the audio file, the [`Audio`] feature will load and resam
9735

9836
## Map
9937

100-
Just like text datasets, you can apply a preprocessing function over an entire dataset with [`Dataset.map`], which is useful for preprocessing all of your audio data at once. Start with a [speech recognition model](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) of your choice, and load a `processor` object that contains:
38+
The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depending on the type of model you're working with, you'll need to either load a [feature extractor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoFeatureExtractor) or a [processor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor).
10139

102-
1. A feature extractor to convert the speech signal to the model's input format. Every speech recognition model on the 🤗 [Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) contains a predefined feature extractor that can be easily loaded with `AutoFeatureExtractor.from_pretrained(...)`.
40+
- For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:
10341

104-
2. A tokenizer to convert the model's output format to text. Fine-tuned speech recognition models, such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), contain a predefined tokenizer that can be easily loaded with `AutoTokenizer.from_pretrained(...)`.
42+
```py
43+
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
10544

106-
For pretrained speech recognition models, such as [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), a tokenizer needs to be created from the target text as explained [here](https://huggingface.co/blog/fine-tune-wav2vec2-english). The following example demonstrates how to load a feature extractor, tokenizer and processor for a pretrained speech recognition model:
45+
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
46+
# after defining a vocab.json file you can instantiate a tokenizer object:
47+
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
48+
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
49+
>>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
50+
```
10751

108-
```py
109-
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, Wav2Vec2Processor
110-
>>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
111-
>>> # after defining a vocab.json file you can instantiate a tokenizer object:
112-
>>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
113-
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
114-
>>> processor = Wav2Vec2Processor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
115-
```
52+
- For fine-tuned speech recognition models, you only need to load a `processor`:
11653

117-
For fine-tuned speech recognition models, you can simply load a predefined `processor` object with:
54+
```py
55+
>>> from transformers import AutoProcessor
11856

119-
```py
120-
>>> from transformers import Wav2Vec2Processor
121-
>>> processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
122-
```
57+
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
58+
```
12359

124-
Make sure to include the `audio` key in your preprocessing function when you call [`Dataset.map`] so that you are actually resampling the audio data:
60+
When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:
12561

12662
```py
12763
>>> def prepare_dataset(batch):
@@ -131,5 +67,5 @@ Make sure to include the `audio` key in your preprocessing function when you cal
13167
... with processor.as_target_processor():
13268
... batch["labels"] = processor(batch["sentence"]).input_ids
13369
... return batch
134-
>>> common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
70+
>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)
13571
```

docs/source/how_to.md

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,21 @@
11
# Overview
22

3-
Our how-to guides will show you how to complete a specific task. These guides are intended to help you apply your knowledge of 🤗 Datasets to real-world problems you may encounter. Want to flatten a column or load a dataset from a local file? We got you covered! You should already be familiar and comfortable with the 🤗 Datasets basics, and if you aren't, we recommend reading our [tutorial](./tutorial) first.
3+
The how-to guides offer a more comprehensive overview of all the tools 🤗 Datasets offers and how to use them. This will help you tackle messier real-world datasets where you may need to manipulate the dataset structure or content to get it ready for training.
44

5-
The how-to guides will cover eight key areas of 🤗 Datasets:
5+
The guides assume you are familiar and comfortable with the 🤗 Datasets basics. We recommend newer users check out our [tutorials](tutorial) first.
66

7-
* How to load a dataset from other data sources.
7+
<Tip>
88

9-
* How to process a dataset.
9+
Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course!
1010

11-
* How to use a dataset with your favorite ML/DL framework.
11+
</Tip>
1212

13-
* How to stream large datasets.
13+
The guides are organized into five sections:
1414

15-
* How to upload and share a dataset.
15+
- **General usage**: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
16+
- **Audio**: How to load, process, and share audio datasets.
17+
- **Vision**: How to load, process, and share image datasets.
18+
- **Text**: How to load, process, and share text datasets.
19+
- **Dataset repository**: How to share and upload a dataset to the [Hub](https://huggingface.co/datasets).
1620

17-
* How to create a dataset loading script.
18-
19-
* How to create a dataset card.
20-
21-
* How to compute metrics.
22-
23-
* How to manage the cache.
24-
25-
You can also find guides on how to process massive datasets with Beam, how to integrate with cloud storage providers, and how to add an index to search your dataset.
21+
If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).

docs/source/how_to_metrics.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Metrics
22

3+
<Tip warning={true}>
4+
5+
Metrics will soon be deprecated in 🤗 Datasets. To learn more about how to use metrics, take a look at our newest library 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index)! In addition to metrics, we've also added more tools for evaluating models and datasets.
6+
7+
</Tip>
8+
39
Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set. You have also seen how to load a metric.
410

511
This guide will show you how to:

0 commit comments

Comments
 (0)