-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Dataset Streaming #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Streaming #2375
Changes from 44 commits
00d69e0
587bbb9
ed6f7c3
bcd5b92
00a4a57
45bd319
ca4a4b3
8abcc73
40e6014
bd9b940
7fb47d9
c461499
ff0702d
f14b51b
eae82d2
5b7843a
a3df1e4
97bce1c
4b655a9
dc5309a
3008988
8579c71
080a083
6172953
f4b84eb
daede36
e2f26dc
064ab00
bbd1389
ed174e5
5fd7edb
79014f8
369d238
602b985
42da548
ba7bbca
4fa549e
4fa1e0f
6a6e21f
39f717a
c9acd44
f5cf3f3
c1a63bf
1bf093f
20aba4d
9c1a2e1
b180bc8
a657d03
cd23946
7f19f4a
5ab438c
8f68a43
0ee60af
ed9569a
593e229
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,138 @@ | ||||||
| Load a Dataset in Streaming mode | ||||||
| ============================================================== | ||||||
|
|
||||||
| When a dataset is in streaming mode, you can iterate over it directly, without having to download the entire dataset. | ||||||
| The data are downloaded progressively as you iterate over the dataset. | ||||||
| You can enable dataset streaming by passing ``streaming=True`` in the :func:`load_dataset` function to get an iterable dataset. | ||||||
|
|
||||||
| This is useful if you don't have enough space on your disk to download the dataset, or if you don't want to wait for your dataset to be downloaded before using it. | ||||||
|
|
||||||
| Here is a demonstration: | ||||||
|
|
||||||
| .. code-block:: | ||||||
|
|
||||||
| >>> from datasets import load_dataset | ||||||
| >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) | ||||||
| >>> print(next(iter(dataset))) | ||||||
| {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... | ||||||
|
|
||||||
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset | ||||||
|
||||||
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset | |
| Even though the dataset is 1.2 terabytes of data, you can start using it right away! Under the hood, it only downloaded the first example of the dataset. |
also, does it download 1 or more than one example when we use next(iter(dataset))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It downloads the first examples (buffering + caching) and yield the first one
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create a classic Dataset from an IterableDataset?
one application that i have in mind is picking the first N examples of a huge dataset, collecting them in a standard Dataset and then doing all my exploration / preprocessing / task preparation etc on that dataset.
e.g. something like
from datasets import load_dataset
dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
# create a `Dataset` i can play with?
sample = dataset.select(range(100))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure definitely :)
I was thinking of adding something like this in a next PR.
Maybe IterableDataset.to_map_style_dataset() ?
To get only the first examples we can also add IterableDataset.take(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, both those features would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the IterableDataset.take(n) as well. Could we also have a IterableDataset.sample(n) taking a random sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we have .take(), you can do iterable_dataset.shuffle(buffer_size=buffer_size, seed=seed).take(n) to take random samples.
This could be simplified by adding a .sample() method indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about slicing support i.e iterable_dataset[100:200] to get an iterator or Dataset at a particular slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid allowing users to get items using __getitem__ since it's not a map-style dataset.
So I agree it would be nice to get a slice of the data, but with a different API. Maybe something like
sliced_dataset = iterable_dataset.skip(100).take(100)What do you think ?
This is pretty close to the tf.data.Dataset API, which is also an iterable dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very cool @lhoestq
Uh oh!
There was an error while loading. Please reload this page.