[docs] Complete to_iterable_dataset (#6158)

stevhliu · web-flow · commit 546c7bb5cbef · 2023-08-17T13:13:14.000-06:00
complete to_iterable_dataset docs
diff --git a/docs/source/access.mdx b/docs/source/access.mdx
@@ -100,6 +100,15 @@ An [`IterableDataset`] is loaded when you set the `streaming` parameter to `True
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}
 ```
 
+You can also create an [`IterableDataset`] from an *existing* [`Dataset`], but it is faster than streaming mode because the dataset is streamed from local files:
+
+```py
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("rotten_tomatoes", split="train")
+>>> iterable_dataset = dataset.to_iterable_dataset()
+```
+
 An [`IterableDataset`] progressively iterates over a dataset one example at a time, so you don't have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!
 
 However, this means an [`IterableDataset`]'s behavior is different from a regular [`Dataset`]. You don't get random access to examples in an [`IterableDataset`]. Instead, you should iterate over its elements, for example, by calling `next(iter())` or with a `for` loop to return the next item from the [`IterableDataset`]:
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -51,6 +51,33 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map
 
 </Tip>
 
+## Convert from a Dataset
+
+If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files.
+
+```py
+>>> from datasets import load_dataset
+
+# faster 🐇
+>>> dataset = load_dataset("food101")
+>>> iterable_dataset = dataset.to_iterable_dataset()
+
+# slower 🐢
+>>> iterable_dataset = load_dataset("food101", streaming=True)
+```
+
+The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`IterableDataset`] is instantiated. This is useful when working with big datasets, and you'd like to shuffle the dataset or to enable fast parallel loading with a PyTorch DataLoader.
+
+```py
+>>> import torch
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("food101")
+>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset
+>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000)  # shuffles the shards order and use a shuffle buffer when you start iterating
+dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4)  # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating
+```
+
 ## Shuffle
 
 Like a regular [`Dataset`] object, you can also shuffle a [`IterableDataset`] with [`IterableDataset.shuffle`].