Skip to content

Commit 546c7bb

Browse files
authored
[docs] Complete to_iterable_dataset (#6158)
complete to_iterable_dataset docs
1 parent a7f8d90 commit 546c7bb

File tree

2 files changed

+36
-0
lines changed

2 files changed

+36
-0
lines changed

docs/source/access.mdx

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,15 @@ An [`IterableDataset`] is loaded when you set the `streaming` parameter to `True
100100
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}
101101
```
102102

103+
You can also create an [`IterableDataset`] from an *existing* [`Dataset`], but it is faster than streaming mode because the dataset is streamed from local files:
104+
105+
```py
106+
>>> from datasets import load_dataset
107+
108+
>>> dataset = load_dataset("rotten_tomatoes", split="train")
109+
>>> iterable_dataset = dataset.to_iterable_dataset()
110+
```
111+
103112
An [`IterableDataset`] progressively iterates over a dataset one example at a time, so you don't have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!
104113

105114
However, this means an [`IterableDataset`]'s behavior is different from a regular [`Dataset`]. You don't get random access to examples in an [`IterableDataset`]. Instead, you should iterate over its elements, for example, by calling `next(iter())` or with a `for` loop to return the next item from the [`IterableDataset`]:

docs/source/stream.mdx

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,33 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map
5151

5252
</Tip>
5353

54+
## Convert from a Dataset
55+
56+
If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files.
57+
58+
```py
59+
>>> from datasets import load_dataset
60+
61+
# faster 🐇
62+
>>> dataset = load_dataset("food101")
63+
>>> iterable_dataset = dataset.to_iterable_dataset()
64+
65+
# slower 🐢
66+
>>> iterable_dataset = load_dataset("food101", streaming=True)
67+
```
68+
69+
The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`IterableDataset`] is instantiated. This is useful when working with big datasets, and you'd like to shuffle the dataset or to enable fast parallel loading with a PyTorch DataLoader.
70+
71+
```py
72+
>>> import torch
73+
>>> from datasets import load_dataset
74+
75+
>>> dataset = load_dataset("food101")
76+
>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset
77+
>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000) # shuffles the shards order and use a shuffle buffer when you start iterating
78+
dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4) # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating
79+
```
80+
5481
## Shuffle
5582

5683
Like a regular [`Dataset`] object, you can also shuffle a [`IterableDataset`] with [`IterableDataset.shuffle`].

0 commit comments

Comments
 (0)