Skip to content

Commit 971e33e

Browse files
alvarobarttstevhliumariosasko
authored
Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository (#5902)
* Fix and re-run `Overview.ipynb` * Update `quickstart.mdx` * Re-ordered subsections so that `Text` goes first * Add machine learning frameworks missing install instructions * Add [[open-in-colab]] button * Add missing license * Fix references to new sub-sections * Remove not required exclamation marks My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * Add `datasets_doc` to host notebooks in `hugginface/notebooks` * Add `notebooks/README.md` As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation` * Apply suggestions from code review Co-authored-by: Mario Šaško <[email protected]> * Apply suggestions from code review Co-authored-by: Mario Šaško <[email protected]> * Revert `Image` and `Text` renames Co-authored-by: Mario Šaško <[email protected]> * Remove reference to `to_tf_dataset` Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]> * Add deprecation message in `Overview.ipynb` In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb Co-authored-by: Mario Šaško <[email protected]> * Add `transformers`, `torch`, and `tensorflow` in `docs` extra For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly Co-authored-by: Mario Šaško <[email protected]> * Add `albumentations` to extend data preparation Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]> * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * Minor improvements --------- Co-authored-by: Steven Liu <[email protected]> Co-authored-by: Mario Šaško <[email protected]>
1 parent f3da7a5 commit 971e33e

File tree

7 files changed

+1155
-2435
lines changed

7 files changed

+1155
-2435
lines changed

.github/workflows/build_documentation.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@ on:
66
- main
77
- doc-builder*
88
- v*-release
9+
- v*-patch
910

1011
jobs:
11-
build:
12+
build:
1213
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
1314
with:
1415
commit_sha: ${{ github.sha }}
1516
package: datasets
17+
notebook_folder: datasets_doc
1618
secrets:
1719
token: ${{ secrets.HUGGINGFACE_PUSH }}
1820
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

README.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -134,9 +134,6 @@ For more details on using the library, check the quick start page in the documen
134134
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
135135
- etc.
136136

137-
Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
138-
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
139-
140137
# Add a new dataset to the Hub
141138

142139
We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).

docs/source/_config.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,11 @@
1+
# docstyle-ignore
2+
INSTALL_CONTENT = """
3+
# Datasets installation
4+
! pip install datasets transformers
5+
# To install from source instead of the last release, comment the command above and uncomment the following one.
6+
# ! pip install git+https://github.com/huggingface/datasets.git
7+
"""
8+
9+
notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
110
default_branch_name = "main"
211
version_prefix = ""

docs/source/quickstart.mdx

Lines changed: 91 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
113
# Quickstart
214

15+
[[open-in-colab]]
16+
317
This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate 🤗 Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](./tutorial), where you'll get a more thorough introduction.
418

519
Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use 🤗 Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
@@ -33,17 +47,34 @@ Start by installing 🤗 Datasets:
3347
pip install datasets
3448
```
3549

36-
To work with audio datasets, install the [`Audio`] feature:
50+
🤗 Datasets also support audio and image data formats:
3751

38-
```bash
39-
pip install datasets[audio]
40-
```
52+
* To work with audio datasets, install the [`Audio`] feature:
53+
54+
```bash
55+
pip install datasets[audio]
56+
```
57+
58+
* To work with image datasets, install the [`Image`] feature:
4159

42-
To work with image datasets, install the [`Image`] feature:
60+
```bash
61+
pip install datasets[vision]
62+
```
4363

64+
Besides 🤗 Datasets, make sure your preferred machine learning framework is installed:
65+
66+
<frameworkcontent>
67+
<pt>
68+
```bash
69+
pip install torch
70+
```
71+
</pt>
72+
<tf>
4473
```bash
45-
pip install datasets[vision]
74+
pip install tensorflow
4675
```
76+
</tf>
77+
</frameworkcontent>
4778

4879
## Audio
4980

@@ -116,16 +147,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
116147
```
117148
</pt>
118149
<tf>
119-
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:
150+
151+
Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
152+
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
153+
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
120154

121155
```py
122156
>>> import tensorflow as tf
123157

124-
>>> tf_dataset = dataset.to_tf_dataset(
125-
... columns=["input_values"],
126-
... label_cols=["labels"],
158+
>>> tf_dataset = model.prepare_tf_dataset(
159+
... dataset,
127160
... batch_size=4,
128-
... shuffle=True)
161+
... shuffle=True,
162+
... )
129163
```
130164
</tf>
131165
</frameworkcontent>
@@ -190,6 +224,42 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
190224
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
191225
```
192226
</pt>
227+
<tf>
228+
229+
Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
230+
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
231+
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
232+
233+
Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
234+
235+
```bash
236+
pip install -U albumentations opencv-python
237+
```
238+
239+
```py
240+
>>> import albumentations
241+
>>> import numpy as np
242+
243+
>>> transform = albumentations.Compose([
244+
... albumentations.RandomCrop(width=256, height=256),
245+
... albumentations.HorizontalFlip(p=0.5),
246+
... albumentations.RandomBrightnessContrast(p=0.2),
247+
... ])
248+
249+
>>> def transforms(examples):
250+
... examples["pixel_values"] = [
251+
... transform(image=np.array(image))["image"] for image in examples["image"]
252+
... ]
253+
... return examples
254+
255+
>>> dataset.set_transform(transforms)
256+
>>> tf_dataset = model.prepare_tf_dataset(
257+
... dataset,
258+
... batch_size=4,
259+
... shuffle=True,
260+
... )
261+
```
262+
</tf>
193263
</frameworkcontent>
194264

195265
**6**. Start training with your machine learning framework! Check out the 🤗 Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.
@@ -259,19 +329,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
259329
```
260330
</pt>
261331
<tf>
262-
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:
332+
333+
Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
334+
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
335+
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
263336

264337
```py
265338
>>> import tensorflow as tf
266-
>>> from transformers import DataCollatorWithPadding
267-
268-
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
269-
>>> tf_dataset = dataset.to_tf_dataset(
270-
... columns=["input_ids", "token_type_ids", "attention_mask"],
271-
... label_cols=["labels"],
272-
... batch_size=2,
273-
... collate_fn=data_collator,
274-
... shuffle=True)
339+
340+
>>> tf_dataset = model.prepare_tf_dataset(
341+
... dataset,
342+
... batch_size=4,
343+
... shuffle=True,
344+
... )
275345
```
276346
</tf>
277347
</frameworkcontent>

0 commit comments

Comments
 (0)