-
Notifications
You must be signed in to change notification settings - Fork 3k
First draft of the docs for TF + Datasets #4457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 17 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c9f3ba0
First draft of the docs for TF + Datasets
Rocketknight1 c46d978
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 f25414d
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 14397ce
Add some example situations to clarify
Rocketknight1 98383ff
Merge branch 'master' into add_tf_docs
lhoestq 088d966
mention nd-arrays
lhoestq 273844a
add to toc
lhoestq ae74186
Link to transformers examples
Rocketknight1 7568800
Missing links added
Rocketknight1 51c1174
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 fcde55e
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 1e7fe76
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 32aa89c
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 b7abf5a
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 e41f8a8
Pushing fixes from the review!
Rocketknight1 8e21ced
Fixing link to to_tf_dataset docs
Rocketknight1 07ab1d7
Adding quick introduction
Rocketknight1 35aad8f
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 91a2784
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 1a5faa1
Update docs/source/use_with_tensorflow.mdx
Rocketknight1 23a3ebd
Merge branch 'master' into add_tf_docs
lhoestq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| # Using Datasets with TensorFlow | ||
|
|
||
| This document is a quick introduction to using `datasets` with TensorFlow, with a particular focus on how to get | ||
| `tf.Tensor` objects out of our datasets, and how to stream data from Hugging Face `Dataset` objects to Keras methods | ||
| like `model.fit()`. | ||
|
|
||
| ### Dataset format | ||
|
|
||
| By default, datasets return regular Python objects: integers, floats, strings, lists, etc. | ||
|
|
||
| To get TensorFlow tensors instead, you can set the format of the dataset to `tf`: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
| >>> data = [[1, 2],[3, 4]] | ||
| >>> ds = Dataset.from_dict({"data": [[1, 2],[3, 4]]}) | ||
| >>> ds = ds.with_format("tf") | ||
| >>> ds[0] | ||
| {'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>} | ||
| >>> ds[:2] | ||
| {'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= | ||
| array([[1, 2], | ||
| [3, 4]])>} | ||
| ``` | ||
|
|
||
| <Tip> | ||
|
|
||
| A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to TensorFlow tensors. | ||
|
|
||
| </Tip> | ||
|
|
||
| This can be useful for converting your dataset to a dict of `Tensor` objects, or for writing a generator to load TF | ||
| samples from it. If you wish to convert the entire dataset to `Tensor`, simply query the full dataset: | ||
|
|
||
| ```py | ||
| >>> ds[:] | ||
| {'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= | ||
| array([[1, 2], | ||
| [3, 4]])>} | ||
| ``` | ||
|
|
||
| ### N-dimensional arrays | ||
Rocketknight1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists. | ||
| In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
| >>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] | ||
| >>> ds = Dataset.from_dict({"data": data}) | ||
| >>> ds = ds.with_format("tf") | ||
| >>> ds[0] | ||
| {'data': <tf.RaggedTensor [[1, 2], [3, 4]]>} | ||
| ``` | ||
|
|
||
| To get a single tensor, you must explicitly use the Array feature type and specify the shape of your tensors: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset, Features, Array2D | ||
| >>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] | ||
| >>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) | ||
| >>> ds = Dataset.from_dict({"data": data}, features=features) | ||
| >>> ds = ds.with_format("tf") | ||
| >>> ds[0] | ||
| {'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy= | ||
| array([[1, 2], | ||
| [3, 4]])>} | ||
| >>> ds[:2] | ||
| {'data': <tf.Tensor: shape=(2, 2, 2), dtype=int64, numpy= | ||
| array([[[1, 2], | ||
| [3, 4]], | ||
|
|
||
| [[5, 6], | ||
| [7, 8]]])>} | ||
| ``` | ||
|
|
||
|
|
||
| ### Other feature types | ||
Rocketknight1 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| [`ClassLabel`] data are properly converted to tensors: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset, Features, ClassLabel | ||
| >>> data = [0, 0, 1] | ||
| >>> features = Features({"data": ClassLabel(names=["negative", "positive"])}) | ||
| >>> ds = Dataset.from_dict({"data": data}, features=features) | ||
| >>> ds = ds.with_format("tf") | ||
| >>> ds[:3] | ||
| {'data': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 0, 1])> | ||
| ``` | ||
|
|
||
| Strings are also supported: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset, Features | ||
| >>> text = ["foo", "bar"] | ||
| >>> data = [0, 1] | ||
| >>> ds = Dataset.from_dict({"text": text, "data": data}) | ||
| >>> ds = ds.with_format("tf") | ||
| >>> ds[:2] | ||
| {'text': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'foo', b'bar'], dtype=object)>, | ||
| 'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>} | ||
| ``` | ||
|
|
||
| You can also explicitly format certain columns and leave the other columns unformatted: | ||
|
|
||
| ```py | ||
| >>> ds = ds.with_format("tf", columns=["data"], output_all_columns=True) | ||
| >>> ds[:2] | ||
| {'data': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>, | ||
| 'text': ['foo', 'bar']} | ||
| ``` | ||
|
|
||
| The [`Image`] and [`Audio`] feature types are not supported yet. | ||
|
|
||
| ## Data loading | ||
|
|
||
| Although you can load individual samples and batches just by indexing into your dataset, this won't work if you want | ||
| to use Keras methods like `fit()` and `predict()`. You could write a generator function that shuffles and loads batches | ||
| from your dataset and `fit()` on that, but that sounds like a lot of unnecessary work. Instead, if you want to stream | ||
| data from your dataset on-the-fly, we recommend converting your dataset to a `tf.data.Dataset` using the | ||
| `to_tf_dataset()` method. | ||
|
|
||
| The `tf.data.Dataset` class covers a wide range of use-cases - it is often created from Tensors in memory, or using a load function to read files on disc | ||
| or external storage. The dataset can be transformed arbitrarily with the `map()` method, or methods like `batch()` | ||
| and `shuffle()` can be used to create a dataset that's ready for training. These methods do not modify the stored data | ||
| in any way - instead, the methods build a data pipeline graph that will be executed when the dataset is iterated over, | ||
| usually during model training or inference. This is different from the `map()` method of Hugging Face `Dataset` objects, | ||
| which runs the map function immediately and saves the new or changed columns. | ||
|
|
||
| Since the entire data preprocessing pipeline can be compiled in a `tf.data.Dataset`, this approach allows for massively | ||
| parallel, asynchronous data loading and training. However, the requirement for graph compilation can be a limitation, | ||
| particularly for Hugging Face tokenizers, which are usually not (yet!) compilable as part of a TF graph. As a result, | ||
| we usually advise pre-processing the dataset as a Hugging Face dataset, where arbitrary Python functions can be | ||
| used, and then converting to `tf.data.Dataset` afterwards using `to_tf_dataset()` to get a batched dataset ready for | ||
| training. To see examples of this approach, please see the [examples](https://github.com/huggingface/transformers/tree/main/examples) or [notebooks](https://huggingface.co/docs/transformers/notebooks) for `transformers`. | ||
|
|
||
| ### Using `to_tf_dataset()` | ||
|
|
||
| Using `to_tf_dataset()` is straightforward. Once your dataset is preprocessed and ready, simply call it like so: | ||
|
|
||
| ```py | ||
| >>> from datasets import Dataset | ||
| >>> data = {"inputs": [[1, 2],[3, 4]], "labels": [0, 1]} | ||
| >>> ds = Dataset.from_dict(data) | ||
| >>> tf_ds = ds.to_tf_dataset( | ||
| columns=["inputs"], | ||
| label_cols=["labels"], | ||
| batch_size=2, | ||
| shuffle=True | ||
| ) | ||
| ``` | ||
|
|
||
| The returned `tf_ds` object here is now fully ready to train on, and can be passed directly to `model.fit()`! Note | ||
| that you set the batch size when creating the dataset, and so you don't need to specify it when calling `fit()`: | ||
|
|
||
| ```py | ||
| >>> model.fit(tf_ds, epochs=2) | ||
| ``` | ||
|
|
||
| For a full description of the arguments, please see the [`~Dataset.to_tf_dataset`] documentation. In many cases, | ||
| you will also need to add a `collate_fn` to your call. This is a function that takes multiple elements of the dataset | ||
| and combines them into a single batch. When all elements have the same length, the built-in default collator will | ||
| suffice, but for more complex tasks a custom collator may be necessary. In particular, many tasks have samples | ||
| with varying sequence lengths which will require a [data collator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator) that can pad batches correctly. You can see examples | ||
| of this in the `transformers` NLP [examples](https://github.com/huggingface/transformers/tree/main/examples) and | ||
| [notebooks](https://huggingface.co/docs/transformers/notebooks), where variable sequence lengths are very common. | ||
|
|
||
| ### When to use to_tf_dataset | ||
|
|
||
| The astute reader may have noticed at this point that we have offered two approaches to achieve the same goal - if you | ||
| want to pass your dataset to a TensorFlow model, you can either convert the dataset to a `Tensor` or `dict` of `Tensors` | ||
| using `.with_format('tf')`, or you can convert the dataset to a `tf.data.Dataset` with `to_tf_dataset()`. Either of these | ||
| can be passed to `model.fit()`, so which should you choose? | ||
|
|
||
| The key thing to recognize is that when you convert the whole dataset to `Tensor`s, it is static and fully loaded into | ||
| RAM. This is simple and convenient, but if any of the following apply, you should probably use `to_tf_dataset()` | ||
| instead: | ||
|
|
||
| - Your dataset is too large to fit in RAM. `to_tf_dataset()` streams only one batch at a time, so even very large | ||
| datasets can be handled with this method. | ||
| - You want to apply random transformations using `dataset.with_transform()` or the `collate_fn`. This is | ||
| common in several modalities, such as image augmentations when training vision models, or random masking when training | ||
| masked language models. Using `to_tf_dataset()` will apply those transformations | ||
| at the moment when a batch is loaded, which means the same samples will get different augmentations each time | ||
| they are loaded. This is usually what you want. | ||
| - Your data has a variable dimension, such as input texts in NLP that consist of varying | ||
| numbers of tokens. When you create a batch with samples with a variable dimension, the standard solution is to | ||
| pad the shorter samples to the length of the longest one. When you stream samples from a dataset with `to_tf_dataset`, | ||
| you can apply this padding to each batch via your `collate_fn`. However, if you want to convert | ||
| such a dataset to dense `Tensor`s, then you will have to pad samples to the length of the longest sample in *the | ||
| entire dataset!* This can result in huge amounts of padding, which wastes memory and reduces your model's speed. | ||
|
|
||
| ### Caveats and limitations | ||
|
|
||
| Right now, `to_tf_dataset()` always return a batched dataset - we will add support for unbatched datasets soon! | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.