Skip to content

Commit ae74186

Browse files
committed
Link to transformers examples
1 parent 273844a commit ae74186

File tree

1 file changed

+5
-4
lines changed

1 file changed

+5
-4
lines changed

docs/source/use_with_tensorflow.mdx

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ array([[1, 2],
2323

2424
<Tip>
2525

26-
A [`Dataset`] object is a wrapper of an Arrow table, which allows fast zero-copy reads from arrays in the dataset to TensorFlow tensors.
26+
A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to TensorFlow tensors.
2727

2828
</Tip>
2929

@@ -162,8 +162,9 @@ For a full description of the arguments, please see the `to_tf_dataset()` docume
162162
you will also need to add a `collate_fn` to your call. This is a function that takes multiple elements of the dataset
163163
and combines them into a single batch. When all elements have the same length, the built-in default collator will
164164
suffice, but for more complex tasks a custom collator may be necessary. In particular, many tasks have samples
165-
with varying sequence lengths which will require a collator that can pad batches correctly. (Link to transformers
166-
collators or examples here?)
165+
with varying sequence lengths which will require a data collator that can pad batches correctly. You can see examples
166+
of this in the `transformers` NLP [examples](https://github.com/huggingface/transformers/tree/main/examples) and
167+
[notebooks](https://huggingface.co/docs/transformers/notebooks), where variable sequence lengths are very common.
167168

168169
### When to use to_tf_dataset
169170

@@ -186,7 +187,7 @@ instead:
186187
- Your data has a variable dimension, such as input texts in NLP that consist of varying
187188
numbers of tokens. When you create a batch with samples with a variable dimension, the standard solution is to
188189
pad the shorter samples to the length of the longest one. When you stream samples from a dataset with `to_tf_dataset`,
189-
you can apply this padding to each batch via your `collate_fn`. (link examples here?) However, if you want to convert
190+
you can apply this padding to each batch via your `collate_fn`. However, if you want to convert
190191
such a dataset to dense `Tensor`s, then you will have to pad samples to the length of the longest sample in *the
191192
entire dataset!* This can result in huge amounts of padding, which wastes memory and reduces your model's speed.
192193

0 commit comments

Comments
 (0)