huggingface · stevhliu · Jul 7, 2022 · Jun 14, 2022 · Jun 14, 2022 · Jun 15, 2022
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -23,38 +23,52 @@
 - sections:
   - local: how_to
     title: Overview
-  - local: loading
-    title: Load
-  - local: process
-    title: Process
-  - local: audio_process
-    title: Process audio data
-  - local: image_process
-    title: Process image data
-  - local: stream
-    title: Stream
-  - local: use_with_tensorflow
-    title: Use with TensorFlow
-  - local: use_with_pytorch
-    title: Use with PyTorch
-  - local: share
-    title: Share
-  - local: dataset_script
-    title: Create a dataset loading script
-  - local: dataset_card
-    title: Create a dataset card
-  - local: repository_structure
-    title: Structure your repository
-  - local: cache
-    title: Cache management
-  - local: filesystems
-    title: Cloud storage
-  - local: faiss_es
-    title: Search index
-  - local: how_to_metrics
-    title: Metrics
-  - local: beam
-    title: Beam Datasets
+  - sections:
+    - local: loading
+      title: Load
+    - local: process
+      title: Process
+    - local: stream
+      title: Stream
+    - local: use_with_tensorflow
+      title: Use with TensorFlow
+    - local: use_with_pytorch
+      title: Use with PyTorch
+    - local: share
+      title: Share
+    - local: dataset_script
+      title: Create a dataset loading script
+    - local: dataset_card
+      title: Create a dataset card
+    - local: repository_structure
+      title: Structure your repository
+    - local: cache
+      title: Cache management
+    - local: filesystems
+      title: Cloud storage
+    - local: faiss_es
+      title: Search index
+    - local: how_to_metrics
+      title: Metrics
+    - local: beam
+      title: Beam Datasets
+    title: "General usage"
+  - sections:
+    - local: audio_load
+      title: Load audio data
+    - local: audio_process
+      title: Process audio data
+    title: "Audio"
+  - sections:
+    - local: image_load
+      title: Load image data
+    - local: image_process
+      title: Process image data
+    title: "Vision"
+  - sections:
+    - local: nlp_process
+      title: Process text data
+    title: "Text"
   title: "How-to guides"
 - sections:
   - local: about_arrow

diff --git a/docs/source/audio_load.mdx b/docs/source/audio_load.mdx
@@ -0,0 +1,49 @@
+# Load audio data
+
+Audio datasets are loaded from the `audio` column, which contains three important fields:
+
+* `array`: the decoded audio data represented as a 1-dimensional array.
+* `path`: the path to the downloaded audio file.
+* `sampling_rate`: the sampling rate of the audio data.
+
+When you load an audio dataset and call the `audio` column, the [`Audio`] feature automatically decodes and resamples the audio file:
+
+```py
+>>> from datasets import load_dataset, Audio
+
+>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
+>>> dataset[0]["audio"]
+{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
+         0.        ,  0.        ], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
+ 'sampling_rate': 8000}
+```
+
+<Tip warning={true}>
+
+Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
+
+</Tip>
+
+## Path
+
+The `path` is useful for loading your own dataset. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and decode it into `array`'s with the [`Audio`] feature:
+
+```py
+>>> audio_dataset = audio_dataset.cast_column("paths_to_my_audio_files", Audio())
+```
+
+If you only want to load the underlying path to the audio dataset without decoding the audio file into an `array`, set `decode=False` in the [`Audio`] feature:
+
+```py
+>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column('audio', Audio(decode=False))
+>>> dataset[0]
+{'audio': {'bytes': None,
+  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
+ 'english_transcription': 'I would like to set up a joint account with my partner',
+ 'intent_class': 11,
+ 'lang_id': 4,
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
+ 'transcription': 'I would like to set up a joint account with my partner'}
+```
+
diff --git a/docs/source/audio_process.mdx b/docs/source/audio_process.mdx
@@ -1,93 +1,31 @@
 # Process audio data
 
-🤗 Datasets supports an [`Audio`] feature, enabling users to load and process raw audio files for training. This guide will show you how to:
+This guide shows specific methods for processing audio datasets. Learn how to:
 
-- Load your own custom audio dataset.
-- Resample audio files.
-- Use [`Dataset.map`] with audio files.
+- Resample the sampling rate.
+- Use [`~Dataset.map`] with audio datasets.
 
-## Installation
+For a guide on processing any type of dataset, take a look at the <a class="bg-pink-200 dark:bg-pink-500 px-1 rounded font-bold underline decoration-pink-900 text-pink-900 dark:bg-pink-500" href="./process">general process guide</a>.
 
-The [`Audio`] feature should be installed as an extra dependency in 🤗 Datasets. Install the [`Audio`] feature (and its dependencies) with pip:
+## Cast
 
-```bash
-pip install datasets[audio]
-```
-
-<Tip warning={true}>
-
-On Linux, non-Python dependency on `libsndfile` package must be installed manually, using your distribution package manager, for example:
-
-```bash
-sudo apt-get install libsndfile1
-```
-
-</Tip>
-
-To support loading audio datasets containing MP3 files, users should additionally install [torchaudio](https://pytorch.org/audio/stable/index.html), so that audio data is handled with high performance.
-
-```bash
-pip install torchaudio
-```
-
-<Tip warning={true}>
-
-torchaudio's `sox_io` [backend](https://pytorch.org/audio/stable/backend.html#) supports decoding `mp3` files. Unfortunately, the `sox_io` backend is only available on Linux/macOS, and is not supported by Windows.
-
-</Tip>
-
-Then you can load an audio dataset the same way you would load a text dataset. For example, load the [Common Voice](https://huggingface.co/datasets/common_voice) dataset with the Turkish configuration:
-
-```py
->>> from datasets import load_dataset, load_metric, Audio
->>> common_voice = load_dataset("common_voice", "tr", split="train")
-```
-
-## Audio datasets
-
-Audio datasets commonly have an `audio` and `path` or `file` column.
-
-`audio` is the actual audio file that is loaded and resampled on-the-fly upon calling it.
-
-```py
->>> common_voice[0]["audio"]
-{'array': array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
-    -8.8930130e-05, -3.8027763e-05, -2.9146671e-05], dtype=float32),
-'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
-'sampling_rate': 48000}
-```
-
-When you access an audio file, it is automatically decoded and resampled. Generally, you should query an audio file like: `common_voice[0]["audio"]`. If you query an audio file with `common_voice["audio"][0]` instead, **all** the audio files in your dataset will be decoded and resampled. This process can take a long time if you have a large dataset.
-
-`path` or `file` is an absolute path to an audio file.
-
-```py
->>> common_voice[0]["path"]
-/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3
-```
-
-The `path` is useful if you want to load your own audio dataset. In this case, provide a column of audio file paths to [`Dataset.cast_column`]:
+The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:
 
 ```py
->>> my_audio_dataset = my_audio_dataset.cast_column("paths_to_my_audio_files", Audio())
-```
-
-## Resample
+>>> from datasets import load_dataset, Audio
 
-Some models expect the audio data to have a certain sampling rate due to how the model was pretrained. For example, the [XLSR-Wav2Vec2](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model expects the input to have a sampling rate of 16kHz, but an audio file from the Common Voice dataset has a sampling rate of 48kHz. You can use [`Dataset.cast_column`] to downsample the sampling rate to 16kHz:
-
-```py
->>> common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
+>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
+>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 ```
 
-The next time you load the audio file, the [`Audio`] feature will load and resample it to 16kHz:
+Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:
 
 ```py
->>> common_voice_train[0]["audio"]
-{'array': array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
--7.4556941e-05, -1.4621433e-05, -5.7861507e-05], dtype=float32),
-'path': '/root/.cache/huggingface/datasets/downloads/extracted/05be0c29807a73c9b099873d2f5975dae6d05e9f7d577458a2466ecb9a2b0c6b/cv-corpus-6.1-2020-12-11/tr/clips/common_voice_tr_21921195.mp3',
-'sampling_rate': 16000}
+>>> dataset[0]["audio"]
+{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
+         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
+ 'sampling_rate': 16000}
 ```
 
 <div class="flex justify-center">
@@ -97,31 +35,29 @@ The next time you load the audio file, the [`Audio`] feature will load and resam
 
 ## Map
 
-Just like text datasets, you can apply a preprocessing function over an entire dataset with [`Dataset.map`], which is useful for preprocessing all of your audio data at once. Start with a [speech recognition model](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) of your choice, and load a `processor` object that contains:
+The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depending on the type of model you're working with, you'll need to either load a [feature extractor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoFeatureExtractor) or a [processor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor).
 
-1. A feature extractor to convert the speech signal to the model's input format. Every speech recognition model on the 🤗 [Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) contains a predefined feature extractor that can be easily loaded with `AutoFeatureExtractor.from_pretrained(...)`.
+- For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:
 
-2. A tokenizer to convert the model's output format to text. Fine-tuned speech recognition models, such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), contain a predefined tokenizer that can be easily loaded with `AutoTokenizer.from_pretrained(...)`.
+    ```py
+    >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
 
-   For pretrained speech recognition models, such as [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), a tokenizer needs to be created from the target text as explained [here](https://huggingface.co/blog/fine-tune-wav2vec2-english). The following example demonstrates how to load a feature extractor, tokenizer and processor for a pretrained speech recognition model:
+    >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
+    # after defining a vocab.json file you can instantiate a tokenizer object:
+    >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
+    >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
+    >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
+    ```
 
-```py
->>> from transformers import AutoTokenizer, AutoFeatureExtractor, Wav2Vec2Processor
->>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
->>> # after defining a vocab.json file you can instantiate a tokenizer object:
->>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
->>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
->>> processor = Wav2Vec2Processor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
-```
+- For fine-tuned speech recognition models, you only need to load a `processor`:
 
-For fine-tuned speech recognition models, you can simply load a predefined `processor` object with:
+    ```py
+    >>> from transformers import AutoProcessor
 
-```py
->>> from transformers import Wav2Vec2Processor
->>> processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
-```
+    >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
+    ```
 
-Make sure to include the `audio` key in your preprocessing function when you call [`Dataset.map`] so that you are actually resampling the audio data:
+When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:
 
 ```py
 >>> def prepare_dataset(batch):
@@ -131,5 +67,5 @@ Make sure to include the `audio` key in your preprocessing function when you cal
 ...     with processor.as_target_processor():
 ...         batch["labels"] = processor(batch["sentence"]).input_ids
 ...     return batch
->>> common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
+>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)
 ```
diff --git a/docs/source/how_to.md b/docs/source/how_to.md
@@ -1,25 +1,28 @@
 # Overview
 
-Our how-to guides will show you how to complete a specific task. These guides are intended to help you apply your knowledge of 🤗 Datasets to real-world problems you may encounter. Want to flatten a column or load a dataset from a local file? We got you covered! You should already be familiar and comfortable with the 🤗 Datasets basics, and if you aren't, we recommend reading our [tutorial](./tutorial) first.
+The how-to guides offer a more comprehensive overview of all the tools 🤗 Datasets offers and how to use them. This will help you tackle some of the messier real-world datasets, where you may need to manipulate the dataset structure or content to get it ready for training.
 
-The how-to guides will cover eight key areas of 🤗 Datasets:
+The guides assume you are familiar and comfortable with the 🤗 Datasets basics. We recommend newer users check out our [tutorials](tutorial) first.
 
-* How to load a dataset from other data sources.
+<Tip>
 
-* How to process a dataset.
+Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course!
 
-* How to use a dataset with your favorite ML/DL framework.
+</Tip>
 
-* How to stream large datasets.
+The guides cover four key areas of 🤗 Datasets:
 
-* How to upload and share a dataset.
+<div>
+    <span class="bg-pink-200 text-pink-900 dark:bg-pink-500 px-1 rounded font-bold">General usage</span>: Functions for general dataset loading and processing. The functions shown in this section are applicable across all dataset modalities.
+</div>
+<div>
+    <span class="bg-yellow-200 text-yellow-900 dark:bg-yellow-500 px-1 rounded font-bold">Audio</span>: How to load, process, and share audio datasets.
+</div>
+<div>
+    <span class="bg-green-200 text-green-900 dark:bg-green-500 px-1 rounded font-bold">Vision</span>: How to load, process, and share image datasets.
+</div>
+<div>
+    <span class="bg-blue-200 text-blue-900 dark:bg-blue-500 px-1 rounded font-bold">NLP</span>: How to load, process, and share NLP datasets.
+</div>
 
-* How to create a dataset loading script.
-
-* How to create a dataset card.
-
-* How to compute metrics.
-
-* How to manage the cache.
-
-You can also find guides on how to process massive datasets with Beam, how to integrate with cloud storage providers, and how to add an index to search your dataset.
+If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).
diff --git a/docs/source/how_to_metrics.mdx b/docs/source/how_to_metrics.mdx
@@ -1,5 +1,11 @@
 # Metrics
 
+<Tip warning={true}>
+
+Metrics will soon be deprecated in 🤗 Datasets. To learn more about how to use metrics, take a look at our newest library 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index)! In addition to metrics, we've also added more tools for evaluating models and datasets.
+
+</Tip>
+
 Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set. You have also seen how to load a metric. 
 
 This guide will show you how to: