Skip to content

Commit e5bf4f3

Browse files
Rocketknight1lhoestqstevhliu
authored
First draft of the docs for TF + Datasets (#4457)
* First draft of the docs for TF + Datasets * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Quentin Lhoest <[email protected]> * Add some example situations to clarify * mention nd-arrays * add to toc * Link to transformers examples * Missing links added * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Steven Liu <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Steven Liu <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Steven Liu <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Steven Liu <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Steven Liu <[email protected]> * Pushing fixes from the review! * Fixing link to to_tf_dataset docs * Adding quick introduction * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/source/use_with_tensorflow.mdx Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Steven Liu <[email protected]>
1 parent 5994036 commit e5bf4f3

File tree

2 files changed

+198
-0
lines changed

2 files changed

+198
-0
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@
3333
title: Process image data
3434
- local: stream
3535
title: Stream
36+
- local: use_with_tensorflow
37+
title: Use with TensorFlow
3638
- local: use_with_pytorch
3739
title: Use with PyTorch
3840
- local: share
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Using Datasets with TensorFlow
2+
3+
This document is a quick introduction to using `datasets` with TensorFlow, with a particular focus on how to get
4+
`tf.Tensor` objects out of our datasets, and how to stream data from Hugging Face `Dataset` objects to Keras methods
5+
like `model.fit()`.
6+
7+
## Dataset format
8+
9+
By default, datasets return regular Python objects: integers, floats, strings, lists, etc.
10+
11+
To get TensorFlow tensors instead, you can set the format of the dataset to `tf`:
12+
13+
```py
14+
>>> from datasets import Dataset
15+
>>> data = [[1, 2],[3, 4]]
16+
>>> ds = Dataset.from_dict({"data": [[1, 2],[3, 4]]})
17+
>>> ds = ds.with_format("tf")
18+
>>> ds[0]
19+
{'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>}
20+
>>> ds[:2]
21+
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
22+
array([[1, 2],
23+
[3, 4]])>}
24+
```
25+
26+
<Tip>
27+
28+
A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to TensorFlow tensors.
29+
30+
</Tip>
31+
32+
This can be useful for converting your dataset to a dict of `Tensor` objects, or for writing a generator to load TF
33+
samples from it. If you wish to convert the entire dataset to `Tensor`, simply query the full dataset:
34+
35+
```py
36+
>>> ds[:]
37+
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
38+
array([[1, 2],
39+
[3, 4]])>}
40+
```
41+
42+
## N-dimensional arrays
43+
44+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
45+
In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
46+
47+
```py
48+
>>> from datasets import Dataset
49+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
50+
>>> ds = Dataset.from_dict({"data": data})
51+
>>> ds = ds.with_format("tf")
52+
>>> ds[0]
53+
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}
54+
```
55+
56+
To get a single tensor, you must explicitly use the Array feature type and specify the shape of your tensors:
57+
58+
```py
59+
>>> from datasets import Dataset, Features, Array2D
60+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
61+
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
62+
>>> ds = Dataset.from_dict({"data": data}, features=features)
63+
>>> ds = ds.with_format("tf")
64+
>>> ds[0]
65+
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
66+
array([[1, 2],
67+
[3, 4]])>}
68+
>>> ds[:2]
69+
{'data': <tf.Tensor: shape=(2, 2, 2), dtype=int64, numpy=
70+
array([[[1, 2],
71+
[3, 4]],
72+
73+
[[5, 6],
74+
[7, 8]]])>}
75+
```
76+
77+
78+
## Other feature types
79+
80+
[`ClassLabel`] data are properly converted to tensors:
81+
82+
```py
83+
>>> from datasets import Dataset, Features, ClassLabel
84+
>>> data = [0, 0, 1]
85+
>>> features = Features({"data": ClassLabel(names=["negative", "positive"])})
86+
>>> ds = Dataset.from_dict({"data": data}, features=features)
87+
>>> ds = ds.with_format("tf")
88+
>>> ds[:3]
89+
{'data': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 0, 1])>
90+
```
91+
92+
Strings are also supported:
93+
94+
```py
95+
>>> from datasets import Dataset, Features
96+
>>> text = ["foo", "bar"]
97+
>>> data = [0, 1]
98+
>>> ds = Dataset.from_dict({"text": text, "data": data})
99+
>>> ds = ds.with_format("tf")
100+
>>> ds[:2]
101+
{'text': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'foo', b'bar'], dtype=object)>,
102+
'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>}
103+
```
104+
105+
You can also explicitly format certain columns and leave the other columns unformatted:
106+
107+
```py
108+
>>> ds = ds.with_format("tf", columns=["data"], output_all_columns=True)
109+
>>> ds[:2]
110+
{'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>,
111+
'text': ['foo', 'bar']}
112+
```
113+
114+
The [`Image`] and [`Audio`] feature types are not supported yet.
115+
116+
## Data loading
117+
118+
Although you can load individual samples and batches just by indexing into your dataset, this won't work if you want
119+
to use Keras methods like `fit()` and `predict()`. You could write a generator function that shuffles and loads batches
120+
from your dataset and `fit()` on that, but that sounds like a lot of unnecessary work. Instead, if you want to stream
121+
data from your dataset on-the-fly, we recommend converting your dataset to a `tf.data.Dataset` using the
122+
`to_tf_dataset()` method.
123+
124+
The `tf.data.Dataset` class covers a wide range of use-cases - it is often created from Tensors in memory, or using a load function to read files on disc
125+
or external storage. The dataset can be transformed arbitrarily with the `map()` method, or methods like `batch()`
126+
and `shuffle()` can be used to create a dataset that's ready for training. These methods do not modify the stored data
127+
in any way - instead, the methods build a data pipeline graph that will be executed when the dataset is iterated over,
128+
usually during model training or inference. This is different from the `map()` method of Hugging Face `Dataset` objects,
129+
which runs the map function immediately and saves the new or changed columns.
130+
131+
Since the entire data preprocessing pipeline can be compiled in a `tf.data.Dataset`, this approach allows for massively
132+
parallel, asynchronous data loading and training. However, the requirement for graph compilation can be a limitation,
133+
particularly for Hugging Face tokenizers, which are usually not (yet!) compilable as part of a TF graph. As a result,
134+
we usually advise pre-processing the dataset as a Hugging Face dataset, where arbitrary Python functions can be
135+
used, and then converting to `tf.data.Dataset` afterwards using `to_tf_dataset()` to get a batched dataset ready for
136+
training. To see examples of this approach, please see the [examples](https://github.com/huggingface/transformers/tree/main/examples) or [notebooks](https://huggingface.co/docs/transformers/notebooks) for `transformers`.
137+
138+
### Using `to_tf_dataset()`
139+
140+
Using `to_tf_dataset()` is straightforward. Once your dataset is preprocessed and ready, simply call it like so:
141+
142+
```py
143+
>>> from datasets import Dataset
144+
>>> data = {"inputs": [[1, 2],[3, 4]], "labels": [0, 1]}
145+
>>> ds = Dataset.from_dict(data)
146+
>>> tf_ds = ds.to_tf_dataset(
147+
columns=["inputs"],
148+
label_cols=["labels"],
149+
batch_size=2,
150+
shuffle=True
151+
)
152+
```
153+
154+
The returned `tf_ds` object here is now fully ready to train on, and can be passed directly to `model.fit()`! Note
155+
that you set the batch size when creating the dataset, and so you don't need to specify it when calling `fit()`:
156+
157+
```py
158+
>>> model.fit(tf_ds, epochs=2)
159+
```
160+
161+
For a full description of the arguments, please see the [`~Dataset.to_tf_dataset`] documentation. In many cases,
162+
you will also need to add a `collate_fn` to your call. This is a function that takes multiple elements of the dataset
163+
and combines them into a single batch. When all elements have the same length, the built-in default collator will
164+
suffice, but for more complex tasks a custom collator may be necessary. In particular, many tasks have samples
165+
with varying sequence lengths which will require a [data collator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator) that can pad batches correctly. You can see examples
166+
of this in the `transformers` NLP [examples](https://github.com/huggingface/transformers/tree/main/examples) and
167+
[notebooks](https://huggingface.co/docs/transformers/notebooks), where variable sequence lengths are very common.
168+
169+
### When to use to_tf_dataset
170+
171+
The astute reader may have noticed at this point that we have offered two approaches to achieve the same goal - if you
172+
want to pass your dataset to a TensorFlow model, you can either convert the dataset to a `Tensor` or `dict` of `Tensors`
173+
using `.with_format('tf')`, or you can convert the dataset to a `tf.data.Dataset` with `to_tf_dataset()`. Either of these
174+
can be passed to `model.fit()`, so which should you choose?
175+
176+
The key thing to recognize is that when you convert the whole dataset to `Tensor`s, it is static and fully loaded into
177+
RAM. This is simple and convenient, but if any of the following apply, you should probably use `to_tf_dataset()`
178+
instead:
179+
180+
- Your dataset is too large to fit in RAM. `to_tf_dataset()` streams only one batch at a time, so even very large
181+
datasets can be handled with this method.
182+
- You want to apply random transformations using `dataset.with_transform()` or the `collate_fn`. This is
183+
common in several modalities, such as image augmentations when training vision models, or random masking when training
184+
masked language models. Using `to_tf_dataset()` will apply those transformations
185+
at the moment when a batch is loaded, which means the same samples will get different augmentations each time
186+
they are loaded. This is usually what you want.
187+
- Your data has a variable dimension, such as input texts in NLP that consist of varying
188+
numbers of tokens. When you create a batch with samples with a variable dimension, the standard solution is to
189+
pad the shorter samples to the length of the longest one. When you stream samples from a dataset with `to_tf_dataset`,
190+
you can apply this padding to each batch via your `collate_fn`. However, if you want to convert
191+
such a dataset to dense `Tensor`s, then you will have to pad samples to the length of the longest sample in *the
192+
entire dataset!* This can result in huge amounts of padding, which wastes memory and reduces your model's speed.
193+
194+
### Caveats and limitations
195+
196+
Right now, `to_tf_dataset()` always return a batched dataset - we will add support for unbatched datasets soon!

0 commit comments

Comments
 (0)