-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Dataset Streaming #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Streaming #2375
Conversation
…ction in datasets scripts [A-D]
…no extraction in datasets scripts [A-D]" This reverts commit 97bce1c.
4f8ee69 to
f5cf3f3
Compare
lewtun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feature look mega cool and will solve a major pain point i've experienced with task templates for large datasets!
i left a few nits in the docs and a couple of questions
docs/source/dataset_streaming.rst
Outdated
| >>> print(next(iter(dataset))) | ||
| {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... | ||
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset | |
| Even though the dataset is 1.2 terabytes of data, you can start using it right away! Under the hood, it only downloaded the first example of the dataset. |
also, does it download 1 or more than one example when we use next(iter(dataset))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It downloads the first examples (buffering + caching) and yield the first one
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect!
| It is common to use several datasets to use a model. For example BERT was trained on a mix of Wikipedia and BookCorpus. | ||
| You can mix several iterable datasets together using :func:`datasets.interleave_datasets`. | ||
|
|
||
| By default, the resulting dataset alternates between the original datasets, but can also define sampling probabilities to sample randomly from the different datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, this is very cool!
|
|
||
| .. note:: | ||
|
|
||
| The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create a classic Dataset from an IterableDataset?
one application that i have in mind is picking the first N examples of a huge dataset, collecting them in a standard Dataset and then doing all my exploration / preprocessing / task preparation etc on that dataset.
e.g. something like
from datasets import load_dataset
dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
# create a `Dataset` i can play with?
sample = dataset.select(range(100))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure definitely :)
I was thinking of adding something like this in a next PR.
Maybe IterableDataset.to_map_style_dataset() ?
To get only the first examples we can also add IterableDataset.take(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, both those features would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the IterableDataset.take(n) as well. Could we also have a IterableDataset.sample(n) taking a random sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we have .take(), you can do iterable_dataset.shuffle(buffer_size=buffer_size, seed=seed).take(n) to take random samples.
This could be simplified by adding a .sample() method indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about slicing support i.e iterable_dataset[100:200] to get an iterator or Dataset at a particular slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid allowing users to get items using __getitem__ since it's not a map-style dataset.
So I agree it would be nice to get a slice of the data, but with a different API. Maybe something like
sliced_dataset = iterable_dataset.skip(100).take(100)What do you think ?
This is pretty close to the tf.data.Dataset API, which is also an iterable dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very cool @lhoestq
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback :) I took your comments into account
docs/source/dataset_streaming.rst
Outdated
| >>> print(next(iter(dataset))) | ||
| {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... | ||
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
lewtun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
|
|
||
| .. note:: | ||
|
|
||
| The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really super cool!
Co-authored-by: Thomas Wolf <[email protected]>
Dataset Streaming
API
Current API is
I already implemented a few methods:
"torch"to get atorch.utils.data.IterableDatasetI would love to have your opinion on the API design :)
Implementation details
Streaming
Data streaming is done using
fsspecwhich has nice caching features.To make dataset streaming work I extend the
openfunction of dataset scripts to support opening remote files without downloading them entirely. It also works with remote compressed archives (currently only zip is supported):I also extend the
os.path.joinfunction to support navigation in remote compressed archives, since it has to deal with the"::"separator. This separator is used byfsspec.Finally I also added a retry mechanism in case the connection fails during data streaming.
Transforms
An IterableDataset wraps an ExamplesIterable instance. There are different subclasses depending on the transforms we want to apply:
mapfunction applied on the flyDatasetBuilder
I use the same builders as usual. I just added a new method
_get_examples_iterable_for_splitto get an ExamplesIterable for a given split. Currently only the GeneratorBasedBuilder and the ArrowBasedBuilder implement it.The BeamBasedBuilder doesn't implement it yet.
It means that datasets like wikipedia and natural_questions can't be loaded as IterableDataset for now.
Other details
I may have to do some changes in many dataset script to usedownloadinstead ofdownload_and_extractwhen extraction is not needed. This will avoid errors for streaming.EDIT: Actually I just check for the extension of the file to do extraction only if needed.
EDIT2: It's not possible to stream from .tar.gz files without downloading the file completely. For now I raise an error if one want to get a streaming dataset based on .tar.gz files.
TODO
usual stuff:
pip install datasets[streaming]