@@ -23,7 +23,7 @@ array([[1, 2],
2323
2424<Tip >
2525
26- A [ ` Dataset ` ] object is a wrapper of an Arrow table, which allows fast zero-copy reads from arrays in the dataset to TensorFlow tensors.
26+ A [ ` Dataset ` ] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to TensorFlow tensors.
2727
2828</Tip >
2929
@@ -162,8 +162,9 @@ For a full description of the arguments, please see the `to_tf_dataset()` docume
162162you will also need to add a `collate_fn` to your call. This is a function that takes multiple elements of the dataset
163163and combines them into a single batch. When all elements have the same length, the built- in default collator will
164164suffice, but for more complex tasks a custom collator may be necessary. In particular, many tasks have samples
165- with varying sequence lengths which will require a collator that can pad batches correctly. (Link to transformers
166- collators or examples here? )
165+ with varying sequence lengths which will require a data collator that can pad batches correctly. You can see examples
166+ of this in the `transformers` NLP [examples](https:// github.com/ huggingface/ transformers/ tree/ main/ examples) and
167+ [notebooks](https:// huggingface.co/ docs/ transformers/ notebooks), where variable sequence lengths are very common.
167168
168169# ## When to use to_tf_dataset
169170
@@ -186,7 +187,7 @@ instead:
186187- Your data has a variable dimension, such as input texts in NLP that consist of varying
187188 numbers of tokens. When you create a batch with samples with a variable dimension, the standard solution is to
188189 pad the shorter samples to the length of the longest one. When you stream samples from a dataset with `to_tf_dataset` ,
189- you can apply this padding to each batch via your `collate_fn` . (link examples here ? ) However, if you want to convert
190+ you can apply this padding to each batch via your `collate_fn` . However, if you want to convert
190191 such a dataset to dense `Tensor` s, then you will have to pad samples to the length of the longest sample in * the
191192 entire dataset!* This can result in huge amounts of padding, which wastes memory and reduces your model' s speed.
192193
0 commit comments