Skip to content

Commit 23a3ebd

Browse files
authored
Merge branch 'master' into add_tf_docs
2 parents 1a5faa1 + e60c99f commit 23a3ebd

File tree

5 files changed

+216
-0
lines changed

5 files changed

+216
-0
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@
3535
title: Stream
3636
- local: use_with_tensorflow
3737
title: Use with TensorFlow
38+
- local: use_with_pytorch
39+
title: Use with PyTorch
3840
- local: share
3941
title: Share
4042
- local: dataset_script

docs/source/how_to.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The how-to guides will cover eight key areas of 🤗 Datasets:
88

99
* How to process a dataset.
1010

11+
* How to use a dataset with your favorite ML/DL framework.
12+
1113
* How to stream large datasets.
1214

1315
* How to upload and share a dataset.

docs/source/use_with_pytorch.mdx

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# Use with PyTorch
2+
3+
This document is a quick introduction to using `datasets` with PyTorch, with a particular focus on how to get
4+
`torch.Tensor` objects out of our datasets, and how to use a PyTorch `DataLoader` and a Hugging Face `Dataset`
5+
with the best performance.
6+
7+
## Dataset format
8+
9+
By default, datasets return regular python objects: integers, floats, strings, lists, etc.
10+
11+
To get PyTorch tensors instead, you can set the format of the dataset to `pytorch` using [`Dataset.with_format`]:
12+
13+
```py
14+
>>> from datasets import Dataset
15+
>>> data = [[1, 2],[3, 4]]
16+
>>> ds = Dataset.from_dict({"data": data})
17+
>>> ds = ds.with_format("torch")
18+
>>> ds[0]
19+
{'data': tensor([1, 2])}
20+
>>> ds[:2]
21+
{'data': tensor([[1, 2],
22+
[3, 4]])}
23+
```
24+
25+
<Tip>
26+
27+
A [`Dataset`] object is a wrapper of an Arrow table, which allows fast zero-copy reads from arrays in the dataset to PyTorch tensors.
28+
29+
</Tip>
30+
31+
32+
To load the data as tensors on a GPU, specify the `device` argument:
33+
```py
34+
>>> import torch
35+
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
36+
>>> ds = ds.with_format("torch", device=device)
37+
>>> ds[0]
38+
{'data': tensor([1, 2], device='cuda:0')}
39+
```
40+
41+
## N-dimensional arrays
42+
43+
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
44+
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor:
45+
46+
```py
47+
>>> from datasets import Dataset
48+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
49+
>>> ds = Dataset.from_dict({"data": data})
50+
>>> ds = ds.with_format("torch")
51+
>>> ds[0]
52+
{'data': [tensor([1, 2]), tensor([3, 4])]}
53+
```
54+
55+
To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
56+
57+
```py
58+
>>> from datasets import Dataset, Features, Array2D
59+
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
60+
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
61+
>>> ds = Dataset.from_dict({"data": data}, features=features)
62+
>>> ds = ds.with_format("torch")
63+
>>> ds[0]
64+
{'data': tensor([[1, 2],
65+
[3, 4]])}
66+
>>> ds[:2]
67+
{'data': tensor([[[1, 2],
68+
[3, 4]],
69+
70+
[[5, 6],
71+
[7, 8]]])}
72+
```
73+
74+
75+
## Other feature types
76+
77+
[`ClassLabel`] data are properly converted to tensors:
78+
79+
```py
80+
>>> from datasets import Dataset, Features, ClassLabel
81+
>>> data = [0, 0, 1]
82+
>>> features = Features({"data": ClassLabel(names=["negative", "positive"])})
83+
>>> ds = Dataset.from_dict({"data": data}, features=features)
84+
>>> ds = ds.with_format("torch")
85+
>>> ds[:3]
86+
{'data': tensor([0, 0, 1])}
87+
```
88+
89+
However, since it's not possible to convert text data to PyTorch tensors, you can't format a `string` column to PyTorch.
90+
Instead, you can explicitly format certain columns and leave the other columns unformatted:
91+
92+
```py
93+
>>> from datasets import Dataset, Features
94+
>>> text = ["foo", "bar"]
95+
>>> data = [0, 1]
96+
>>> ds = Dataset.from_dict({"text": text, "data": data})
97+
>>> ds = ds.with_format("torch", columns=["data"], output_all_columns=True)
98+
>>> ds[:2]
99+
{'data': tensor([0, 1]), 'text': ['foo', 'bar']}
100+
```
101+
102+
The [`Image`] and [`Audio`] feature types are not supported yet.
103+
104+
## Data loading
105+
106+
Like `torch.utils.data.Dataset` objects, a [`Dataset`] can be passed directly to a PyTorch `DataLoader`:
107+
108+
```py
109+
>>> import numpy as np
110+
>>> from datasets import Dataset
111+
>>> from torch.utils.data import DataLoader
112+
>>> data = np.random.rand(16)
113+
>>> label = np.random.randint(0, 2, size=16)
114+
>>> ds = Dataset.from_dict({"data": data, "label": label}).with_format("torch")
115+
>>> dataloader = DataLoader(ds, batch_size=4)
116+
>>> for batch in dataloader:
117+
... print(batch)
118+
{'data': tensor([0.0047, 0.4979, 0.6726, 0.8105]), 'label': tensor([0, 1, 0, 1])}
119+
{'data': tensor([0.4832, 0.2723, 0.4259, 0.2224]), 'label': tensor([0, 0, 0, 0])}
120+
{'data': tensor([0.5837, 0.3444, 0.4658, 0.6417]), 'label': tensor([0, 1, 0, 0])}
121+
{'data': tensor([0.7022, 0.1225, 0.7228, 0.8259]), 'label': tensor([1, 1, 1, 1])}
122+
```
123+
124+
### Optimize data loading
125+
126+
There are several ways you can increase the speed your data is loaded which can save you time, especially if you are working with large datasets.
127+
PyTorch offers parallelized data loading, retrieving batches of indices instead of individually, and streaming to progressively download datasets.
128+
129+
#### Use multiple Workers
130+
131+
You can parallelize data loading with the `num_workers` argument of a PyTorch `DataLoader` and get a higher throughput.
132+
133+
Under the hood, the `DataLoader` starts `num_workers` processes.
134+
Each process reloads the dataset passed to the `DataLoader` and is used to query examples.
135+
Reloading the dataset inside a worker doesn't fill up your RAM, since it simply memory-maps the dataset again from your disk.
136+
137+
```py
138+
>>> import numpy as np
139+
>>> from datasets import Dataset, load_from_disk
140+
>>> from torch.utils.data import DataLoader
141+
>>> data = np.random.rand(10_000)
142+
>>> Dataset.from_dict({"data": data}).save_to_disk("my_dataset")
143+
>>> ds = load_from_disk("my_dataset").with_format("torch")
144+
>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)
145+
```
146+
147+
#### Use a BatchSampler
148+
149+
By default, the PyTorch `DataLoader` load batches of data from a dataset one by one like this:
150+
151+
```py
152+
batch = [dataset[idx] for idx in range(start, end)]
153+
```
154+
155+
Unfortunately, this does numerous read operations on the dataset.
156+
It is more efficient to query batches of examples using a list:
157+
158+
```py
159+
batch = dataset[start:end]
160+
# or
161+
batch = dataset[list_of_indices]
162+
```
163+
164+
For the PyTorch `DataLoader` to query batches using a list, you can use a `BatchSampler`:
165+
166+
```py
167+
>>> from torch.utils.data.sampler import BatchSampler, RandomSampler
168+
>>> sampler = BatchSampler(RandomSampler(ds), batch_size=32, drop_last=False)
169+
>>> dataloader = DataLoader(ds, sampler=sampler)
170+
```
171+
172+
Moreover, this is particularly useful if you used [`set_transform`] to apply a transform on-the-fly when examples are accessed.
173+
You must use a `BatchSampler` if you want the transform to be given full batches instead of receiving `batch_size` times one single element.
174+
175+
### Stream data
176+
177+
Loading a dataset in streaming mode is useful to progressively download the data you need while iterating over the dataset.
178+
Set the format of a streaming dataset to `torch`, and it inherits from `torch.utils.data.IterableDataset` so you can pass it to a `DataLoader`:
179+
180+
```py
181+
>>> import numpy as np
182+
>>> from datasets import Dataset, load_dataset
183+
>>> from torch.utils.data import DataLoader
184+
>>> data = np.random.rand(10_000)
185+
>>> Dataset.from_dict({"data": data}).push_to_hub("<username>/my_dataset") # Upload to the Hugging Face Hub
186+
>>> ds = load_dataset("<username>/my_dataset", streaming=True, split="train").with_format("torch")
187+
>>> dataloader = DataLoader(ds, batch_size=32)
188+
```
189+
190+
If the dataset is split in several shards (i.e. if the dataset consists of multiple data files), then you can stream in parallel using `num_workers`:
191+
192+
```py
193+
>>> ds = load_dataset("c4", "en", streaming=True, split="train").with_format("torch")
194+
>>> ds.n_shards
195+
1024
196+
>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)
197+
```
198+
199+
In this case each worker will be given a subset of the list of shards to stream from.

src/datasets/table.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1749,6 +1749,8 @@ def array_cast(array: pa.Array, pa_type: pa.DataType, allow_number_to_str=True):
17491749
raise TypeError(
17501750
f"Couldn't cast array of type {array.type} to {pa_type} since allow_number_to_str is set to {allow_number_to_str}"
17511751
)
1752+
if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
1753+
raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
17521754
return array.cast(pa_type)
17531755
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
17541756

tests/test_table.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1034,3 +1034,14 @@ def test_cast_array_to_features_nested_with_null_values():
10341034
assert casted_array.to_pylist() == [
10351035
{"foo": [[], [0]]}
10361036
] # empty list because of https://github.com/huggingface/datasets/issues/3676
1037+
1038+
1039+
def test_cast_array_to_features_to_null_type():
1040+
# same type
1041+
arr = pa.array([[None, None]])
1042+
assert cast_array_to_feature(arr, Sequence(Value("null"))).type == pa.list_(pa.null())
1043+
1044+
# different type
1045+
arr = pa.array([[None, 1]])
1046+
with pytest.raises(TypeError):
1047+
cast_array_to_feature(arr, Sequence(Value("null")))

0 commit comments

Comments
 (0)