Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ on:
- main
- doc-builder*
- v*-release
- v*-patch

jobs:
build:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: datasets
notebook_folder: datasets_doc
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,6 @@ For more details on using the library, check the quick start page in the documen
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
- etc.

Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)

# Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
Expand Down
9 changes: 9 additions & 0 deletions docs/source/_config.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
# docstyle-ignore
INSTALL_CONTENT = """
# Datasets installation
! pip install datasets transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/datasets.git
"""

notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
default_branch_name = "main"
version_prefix = ""
112 changes: 91 additions & 21 deletions docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Quickstart

[[open-in-colab]]

This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate 🤗 Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](./tutorial), where you'll get a more thorough introduction.

Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use 🤗 Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
Expand Down Expand Up @@ -33,17 +47,34 @@ Start by installing 🤗 Datasets:
pip install datasets
```

To work with audio datasets, install the [`Audio`] feature:
🤗 Datasets also support audio and image data formats:

```bash
pip install datasets[audio]
```
* To work with audio datasets, install the [`Audio`] feature:

```bash
pip install datasets[audio]
```

* To work with image datasets, install the [`Image`] feature:

To work with image datasets, install the [`Image`] feature:
```bash
pip install datasets[vision]
```

Besides 🤗 Datasets, make sure your preferred machine learning framework is installed:

<frameworkcontent>
<pt>
```bash
pip install torch
```
</pt>
<tf>
```bash
pip install datasets[vision]
pip install tensorflow
```
</tf>
</frameworkcontent>

## Audio

Expand Down Expand Up @@ -116,16 +147,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
```
</pt>
<tf>
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

```py
>>> import tensorflow as tf

>>> tf_dataset = dataset.to_tf_dataset(
... columns=["input_values"],
... label_cols=["labels"],
>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True)
... shuffle=True,
... )
```
</tf>
</frameworkcontent>
Expand Down Expand Up @@ -190,6 +224,42 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
```
</pt>
<tf>

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:

```bash
pip install -U albumentations opencv-python
```

```py
>>> import albumentations
>>> import numpy as np

>>> transform = albumentations.Compose([
... albumentations.RandomCrop(width=256, height=256),
... albumentations.HorizontalFlip(p=0.5),
... albumentations.RandomBrightnessContrast(p=0.2),
... ])

>>> def transforms(examples):
... examples["pixel_values"] = [
... transform(image=np.array(image))["image"] for image in examples["image"]
... ]
... return examples

>>> dataset.set_transform(transforms)
>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True,
... )
```
</tf>
</frameworkcontent>

**6**. Start training with your machine learning framework! Check out the 🤗 Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.
Expand Down Expand Up @@ -259,19 +329,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
```
</pt>
<tf>
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

```py
>>> import tensorflow as tf
>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
>>> tf_dataset = dataset.to_tf_dataset(
... columns=["input_ids", "token_type_ids", "attention_mask"],
... label_cols=["labels"],
... batch_size=2,
... collate_fn=data_collator,
... shuffle=True)

>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True,
... )
```
</tf>
</frameworkcontent>
Expand Down
Loading