|
| 1 | +# Using Datasets with TensorFlow |
| 2 | + |
| 3 | +This document is a quick introduction to using `datasets` with TensorFlow, with a particular focus on how to get |
| 4 | +`tf.Tensor` objects out of our datasets, and how to stream data from Hugging Face `Dataset` objects to Keras methods |
| 5 | +like `model.fit()`. |
| 6 | + |
| 7 | +## Dataset format |
| 8 | + |
| 9 | +By default, datasets return regular Python objects: integers, floats, strings, lists, etc. |
| 10 | + |
| 11 | +To get TensorFlow tensors instead, you can set the format of the dataset to `tf`: |
| 12 | + |
| 13 | +```py |
| 14 | +>>> from datasets import Dataset |
| 15 | +>>> data = [[1, 2],[3, 4]] |
| 16 | +>>> ds = Dataset.from_dict({"data": [[1, 2],[3, 4]]}) |
| 17 | +>>> ds = ds.with_format("tf") |
| 18 | +>>> ds[0] |
| 19 | +{'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>} |
| 20 | +>>> ds[:2] |
| 21 | +{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= |
| 22 | +array([[1, 2], |
| 23 | + [3, 4]])>} |
| 24 | +``` |
| 25 | + |
| 26 | +<Tip> |
| 27 | + |
| 28 | +A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to TensorFlow tensors. |
| 29 | + |
| 30 | +</Tip> |
| 31 | + |
| 32 | +This can be useful for converting your dataset to a dict of `Tensor` objects, or for writing a generator to load TF |
| 33 | +samples from it. If you wish to convert the entire dataset to `Tensor`, simply query the full dataset: |
| 34 | + |
| 35 | +```py |
| 36 | +>>> ds[:] |
| 37 | +{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= |
| 38 | +array([[1, 2], |
| 39 | + [3, 4]])>} |
| 40 | +``` |
| 41 | + |
| 42 | +## N-dimensional arrays |
| 43 | + |
| 44 | +If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists. |
| 45 | +In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor: |
| 46 | + |
| 47 | +```py |
| 48 | +>>> from datasets import Dataset |
| 49 | +>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] |
| 50 | +>>> ds = Dataset.from_dict({"data": data}) |
| 51 | +>>> ds = ds.with_format("tf") |
| 52 | +>>> ds[0] |
| 53 | +{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>} |
| 54 | +``` |
| 55 | + |
| 56 | +To get a single tensor, you must explicitly use the Array feature type and specify the shape of your tensors: |
| 57 | + |
| 58 | +```py |
| 59 | +>>> from datasets import Dataset, Features, Array2D |
| 60 | +>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] |
| 61 | +>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) |
| 62 | +>>> ds = Dataset.from_dict({"data": data}, features=features) |
| 63 | +>>> ds = ds.with_format("tf") |
| 64 | +>>> ds[0] |
| 65 | +{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= |
| 66 | + array([[1, 2], |
| 67 | + [3, 4]])>} |
| 68 | +>>> ds[:2] |
| 69 | +{'data': <tf.Tensor: shape=(2, 2, 2), dtype=int64, numpy= |
| 70 | + array([[[1, 2], |
| 71 | + [3, 4]], |
| 72 | + |
| 73 | + [[5, 6], |
| 74 | + [7, 8]]])>} |
| 75 | +``` |
| 76 | + |
| 77 | + |
| 78 | +## Other feature types |
| 79 | + |
| 80 | +[`ClassLabel`] data are properly converted to tensors: |
| 81 | + |
| 82 | +```py |
| 83 | +>>> from datasets import Dataset, Features, ClassLabel |
| 84 | +>>> data = [0, 0, 1] |
| 85 | +>>> features = Features({"data": ClassLabel(names=["negative", "positive"])}) |
| 86 | +>>> ds = Dataset.from_dict({"data": data}, features=features) |
| 87 | +>>> ds = ds.with_format("tf") |
| 88 | +>>> ds[:3] |
| 89 | +{'data': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 0, 1])> |
| 90 | +``` |
| 91 | + |
| 92 | +Strings are also supported: |
| 93 | + |
| 94 | +```py |
| 95 | +>>> from datasets import Dataset, Features |
| 96 | +>>> text = ["foo", "bar"] |
| 97 | +>>> data = [0, 1] |
| 98 | +>>> ds = Dataset.from_dict({"text": text, "data": data}) |
| 99 | +>>> ds = ds.with_format("tf") |
| 100 | +>>> ds[:2] |
| 101 | +{'text': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'foo', b'bar'], dtype=object)>, |
| 102 | + 'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>} |
| 103 | +``` |
| 104 | + |
| 105 | +You can also explicitly format certain columns and leave the other columns unformatted: |
| 106 | + |
| 107 | +```py |
| 108 | +>>> ds = ds.with_format("tf", columns=["data"], output_all_columns=True) |
| 109 | +>>> ds[:2] |
| 110 | +{'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>, |
| 111 | + 'text': ['foo', 'bar']} |
| 112 | +``` |
| 113 | + |
| 114 | +The [`Image`] and [`Audio`] feature types are not supported yet. |
| 115 | + |
| 116 | +## Data loading |
| 117 | + |
| 118 | +Although you can load individual samples and batches just by indexing into your dataset, this won't work if you want |
| 119 | +to use Keras methods like `fit()` and `predict()`. You could write a generator function that shuffles and loads batches |
| 120 | +from your dataset and `fit()` on that, but that sounds like a lot of unnecessary work. Instead, if you want to stream |
| 121 | +data from your dataset on-the-fly, we recommend converting your dataset to a `tf.data.Dataset` using the |
| 122 | +`to_tf_dataset()` method. |
| 123 | + |
| 124 | +The `tf.data.Dataset` class covers a wide range of use-cases - it is often created from Tensors in memory, or using a load function to read files on disc |
| 125 | +or external storage. The dataset can be transformed arbitrarily with the `map()` method, or methods like `batch()` |
| 126 | +and `shuffle()` can be used to create a dataset that's ready for training. These methods do not modify the stored data |
| 127 | +in any way - instead, the methods build a data pipeline graph that will be executed when the dataset is iterated over, |
| 128 | +usually during model training or inference. This is different from the `map()` method of Hugging Face `Dataset` objects, |
| 129 | +which runs the map function immediately and saves the new or changed columns. |
| 130 | + |
| 131 | +Since the entire data preprocessing pipeline can be compiled in a `tf.data.Dataset`, this approach allows for massively |
| 132 | +parallel, asynchronous data loading and training. However, the requirement for graph compilation can be a limitation, |
| 133 | +particularly for Hugging Face tokenizers, which are usually not (yet!) compilable as part of a TF graph. As a result, |
| 134 | +we usually advise pre-processing the dataset as a Hugging Face dataset, where arbitrary Python functions can be |
| 135 | +used, and then converting to `tf.data.Dataset` afterwards using `to_tf_dataset()` to get a batched dataset ready for |
| 136 | +training. To see examples of this approach, please see the [examples](https://github.com/huggingface/transformers/tree/main/examples) or [notebooks](https://huggingface.co/docs/transformers/notebooks) for `transformers`. |
| 137 | + |
| 138 | +### Using `to_tf_dataset()` |
| 139 | + |
| 140 | +Using `to_tf_dataset()` is straightforward. Once your dataset is preprocessed and ready, simply call it like so: |
| 141 | + |
| 142 | +```py |
| 143 | +>>> from datasets import Dataset |
| 144 | +>>> data = {"inputs": [[1, 2],[3, 4]], "labels": [0, 1]} |
| 145 | +>>> ds = Dataset.from_dict(data) |
| 146 | +>>> tf_ds = ds.to_tf_dataset( |
| 147 | + columns=["inputs"], |
| 148 | + label_cols=["labels"], |
| 149 | + batch_size=2, |
| 150 | + shuffle=True |
| 151 | + ) |
| 152 | +``` |
| 153 | + |
| 154 | +The returned `tf_ds` object here is now fully ready to train on, and can be passed directly to `model.fit()`! Note |
| 155 | +that you set the batch size when creating the dataset, and so you don't need to specify it when calling `fit()`: |
| 156 | + |
| 157 | +```py |
| 158 | +>>> model.fit(tf_ds, epochs=2) |
| 159 | +``` |
| 160 | + |
| 161 | +For a full description of the arguments, please see the [`~Dataset.to_tf_dataset`] documentation. In many cases, |
| 162 | +you will also need to add a `collate_fn` to your call. This is a function that takes multiple elements of the dataset |
| 163 | +and combines them into a single batch. When all elements have the same length, the built-in default collator will |
| 164 | +suffice, but for more complex tasks a custom collator may be necessary. In particular, many tasks have samples |
| 165 | +with varying sequence lengths which will require a [data collator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator) that can pad batches correctly. You can see examples |
| 166 | +of this in the `transformers` NLP [examples](https://github.com/huggingface/transformers/tree/main/examples) and |
| 167 | +[notebooks](https://huggingface.co/docs/transformers/notebooks), where variable sequence lengths are very common. |
| 168 | + |
| 169 | +### When to use to_tf_dataset |
| 170 | + |
| 171 | +The astute reader may have noticed at this point that we have offered two approaches to achieve the same goal - if you |
| 172 | +want to pass your dataset to a TensorFlow model, you can either convert the dataset to a `Tensor` or `dict` of `Tensors` |
| 173 | +using `.with_format('tf')`, or you can convert the dataset to a `tf.data.Dataset` with `to_tf_dataset()`. Either of these |
| 174 | +can be passed to `model.fit()`, so which should you choose? |
| 175 | + |
| 176 | +The key thing to recognize is that when you convert the whole dataset to `Tensor`s, it is static and fully loaded into |
| 177 | +RAM. This is simple and convenient, but if any of the following apply, you should probably use `to_tf_dataset()` |
| 178 | +instead: |
| 179 | + |
| 180 | +- Your dataset is too large to fit in RAM. `to_tf_dataset()` streams only one batch at a time, so even very large |
| 181 | + datasets can be handled with this method. |
| 182 | +- You want to apply random transformations using `dataset.with_transform()` or the `collate_fn`. This is |
| 183 | + common in several modalities, such as image augmentations when training vision models, or random masking when training |
| 184 | + masked language models. Using `to_tf_dataset()` will apply those transformations |
| 185 | + at the moment when a batch is loaded, which means the same samples will get different augmentations each time |
| 186 | + they are loaded. This is usually what you want. |
| 187 | +- Your data has a variable dimension, such as input texts in NLP that consist of varying |
| 188 | + numbers of tokens. When you create a batch with samples with a variable dimension, the standard solution is to |
| 189 | + pad the shorter samples to the length of the longest one. When you stream samples from a dataset with `to_tf_dataset`, |
| 190 | + you can apply this padding to each batch via your `collate_fn`. However, if you want to convert |
| 191 | + such a dataset to dense `Tensor`s, then you will have to pad samples to the length of the longest sample in *the |
| 192 | + entire dataset!* This can result in huge amounts of padding, which wastes memory and reduces your model's speed. |
| 193 | + |
| 194 | +### Caveats and limitations |
| 195 | + |
| 196 | +Right now, `to_tf_dataset()` always return a batched dataset - we will add support for unbatched datasets soon! |
0 commit comments