diff --git a/docs/source/access.mdx b/docs/source/access.mdx index 483f56d1b6e..f925056e0cf 100644 --- a/docs/source/access.mdx +++ b/docs/source/access.mdx @@ -100,6 +100,15 @@ An [`IterableDataset`] is loaded when you set the `streaming` parameter to `True {'image': , 'label': 6} ``` +You can also create an [`IterableDataset`] from an *existing* [`Dataset`], but it is faster than streaming mode because the dataset is streamed from local files: + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> iterable_dataset = dataset.to_iterable_dataset() +``` + An [`IterableDataset`] progressively iterates over a dataset one example at a time, so you don't have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately! However, this means an [`IterableDataset`]'s behavior is different from a regular [`Dataset`]. You don't get random access to examples in an [`IterableDataset`]. Instead, you should iterate over its elements, for example, by calling `next(iter())` or with a `for` loop to return the next item from the [`IterableDataset`]: diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx index e64e691da3c..8dab82b61a4 100644 --- a/docs/source/stream.mdx +++ b/docs/source/stream.mdx @@ -51,6 +51,33 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map +## Convert from a Dataset + +If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files. + +```py +>>> from datasets import load_dataset + +# faster 🐇 +>>> dataset = load_dataset("food101") +>>> iterable_dataset = dataset.to_iterable_dataset() + +# slower 🐢 +>>> iterable_dataset = load_dataset("food101", streaming=True) +``` + +The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`IterableDataset`] is instantiated. This is useful when working with big datasets, and you'd like to shuffle the dataset or to enable fast parallel loading with a PyTorch DataLoader. + +```py +>>> import torch +>>> from datasets import load_dataset + +>>> dataset = load_dataset("food101") +>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset +>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000) # shuffles the shards order and use a shuffle buffer when you start iterating +dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4) # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating +``` + ## Shuffle Like a regular [`Dataset`] object, you can also shuffle a [`IterableDataset`] with [`IterableDataset.shuffle`].