Dataset Streaming #2375

lhoestq · 2021-05-18T18:20:00Z

Dataset Streaming

API

Current API is

from datasets import load_dataset

# Load an IterableDataset without downloading data
snli = load_dataset("snli", streaming=True)

# Access examples by streaming data
print(next(iter(snli["train"]))) 
# {'premise': 'A person on a horse jumps over a broken down airplane.',
#  'hypothesis': 'A person is training his horse for a competition.',
#  'label': 1}

I already implemented a few methods:

IterableDataset.map: apply transforms on-the-fly to the examples
IterableDataset.shuffle: shuffle the data a la TFDS, i.e. with a shuffling buffer
IterableDataset.with_format: set the format to "torch" to get a torch.utils.data.IterableDataset
merge_datasets: merge two iterable datasets by alternating one or the other (you can specify the probabilities)

I would love to have your opinion on the API design :)

Implementation details

Streaming

Data streaming is done using fsspec which has nice caching features.

To make dataset streaming work I extend the open function of dataset scripts to support opening remote files without downloading them entirely. It also works with remote compressed archives (currently only zip is supported):

# Get a file-like object by streaming data from a remote file
open("https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt")

# Get a file-like object by streaming data from a remote compressed archive by using the hop separator "::"
open("zip://snli_1.0_train.txt::https://nlp.stanford.edu/projects/snli/snli_1.0.zip")

I also extend the os.path.join function to support navigation in remote compressed archives, since it has to deal with the "::" separator. This separator is used by fsspec.

Finally I also added a retry mechanism in case the connection fails during data streaming.

Transforms

An IterableDataset wraps an ExamplesIterable instance. There are different subclasses depending on the transforms we want to apply:

ExamplesIterable: the basic one
MappedExamplesIterable: an iterable with a map function applied on the fly
BufferShuffledExamplesIterable: an iterable with a shuffling buffer
CyclingMultiSourcesExamplesIterable: alternates between several ExamplesIterable
RandomlyCyclingMultiSourcesExamplesIterable: randomly alternates between several ExamplesIterable

DatasetBuilder

I use the same builders as usual. I just added a new method _get_examples_iterable_for_split to get an ExamplesIterable for a given split. Currently only the GeneratorBasedBuilder and the ArrowBasedBuilder implement it.

The BeamBasedBuilder doesn't implement it yet.
It means that datasets like wikipedia and natural_questions can't be loaded as IterableDataset for now.

Other details

~~I may have to do some changes in many dataset script to use download instead of download_and_extract when extraction is not needed. This will avoid errors for streaming.~~

EDIT: Actually I just check for the extension of the file to do extraction only if needed.

EDIT2: It's not possible to stream from .tar.gz files without downloading the file completely. For now I raise an error if one want to get a streaming dataset based on .tar.gz files.

TODO

usual stuff:

make streaming dependency "aiohttp" optional: pip install datasets[streaming]
tests
docs

…ction in datasets scripts [A-D]

…no extraction in datasets scripts [A-D]" This reverts commit 97bce1c.

src/datasets/utils/resources/languages.json

lewtun

this feature look mega cool and will solve a major pain point i've experienced with task templates for large datasets!

i left a few nits in the docs and a couple of questions

docs/source/dataset_streaming.rst

lewtun · 2021-06-18T07:34:26Z

docs/source/dataset_streaming.rst

+    >>> print(next(iter(dataset)))
+    {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help...
+
+Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset


nit:

Suggested change

Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset

Even though the dataset is 1.2 terabytes of data, you can start using it right away! Under the hood, it only downloaded the first example of the dataset.

also, does it download 1 or more than one example when we use next(iter(dataset))?

It downloads the first examples (buffering + caching) and yield the first one

docs/source/dataset_streaming.rst

lewtun · 2021-06-18T07:38:54Z

docs/source/dataset_streaming.rst

+It is common to use several datasets to use a model. For example BERT was trained on a mix of Wikipedia and BookCorpus.
+You can mix several iterable datasets together using :func:`datasets.interleave_datasets`.
+
+By default, the resulting dataset alternates between the original datasets, but can also define sampling probabilities to sample randomly from the different datasets.


wow, this is very cool!

docs/source/dataset_streaming.rst

src/datasets/filesystems/compression/gzip.py

lewtun · 2021-06-18T07:48:58Z

docs/source/dataset_streaming.rst

+
+.. note::
+
+    The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples.


is it possible to create a classic Dataset from an IterableDataset?

one application that i have in mind is picking the first N examples of a huge dataset, collecting them in a standard Dataset and then doing all my exploration / preprocessing / task preparation etc on that dataset.

e.g. something like

from datasets import load_dataset dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) # create a `Dataset` i can play with? sample = dataset.select(range(100))

Sure definitely :)

I was thinking of adding something like this in a next PR.
Maybe IterableDataset.to_map_style_dataset() ?
To get only the first examples we can also add IterableDataset.take(n)

yeah, both those features would be great!

I like the IterableDataset.take(n) as well. Could we also have a IterableDataset.sample(n) taking a random sample?

a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling

As soon as we have .take(), you can do iterable_dataset.shuffle(buffer_size=buffer_size, seed=seed).take(n) to take random samples.
This could be simplified by adding a .sample() method indeed

How about slicing support i.e iterable_dataset[100:200] to get an iterator or Dataset at a particular slice?

I would like to avoid allowing users to get items using __getitem__ since it's not a map-style dataset.
So I agree it would be nice to get a slice of the data, but with a different API. Maybe something like

sliced_dataset = iterable_dataset.skip(100).take(100)

What do you think ?

This is pretty close to the tf.data.Dataset API, which is also an iterable dataset.

Yes, very cool @lhoestq

src/datasets/iterable_dataset.py

lhoestq

Thanks for the feedback :) I took your comments into account

docs/source/dataset_streaming.rst

+    >>> print(next(iter(dataset)))
+    {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help...
+
+Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset


src/datasets/filesystems/compression/gzip.py

src/datasets/iterable_dataset.py

lewtun

LGTM 🚀

lewtun · 2021-06-22T19:16:01Z

docs/source/dataset_streaming.rst

+
+.. note::
+
+    The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples.


a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling

src/datasets/filesystems/compression/gzip.py

thomwolf

This is really super cool!

src/datasets/combine.py

Co-authored-by: Thomas Wolf <[email protected]>

lhoestq added 20 commits April 14, 2021 15:06

add streaming module

00d69e0

make oscar streaming compatible

587bbb9

minor

ed6f7c3

minor

bcd5b92

use right name for format type

00a4a57

add shuffle

45bd319

shuffle examples generator's source data files

ca4a4b3

add transform_batch_size

8abcc73

clean shuffling buffer

40e6014

iterable_dataset factory

bd9b940

support hub datasets and arrow based builder

7fb47d9

Merge branch 'master' into streaming

c461499

add merge_datasets with probabilities

ff0702d

style

f14b51b

add retries if server disconnection occurs

eae82d2

add aiohttp to setup.py

5b7843a

allow streaming from zip files

a3df1e4

replace download_and_extract by simple download when there's no extra…

97bce1c

…ction in datasets scripts [A-D]

Merge branch 'master' into streaming

4b655a9

re-organize code

dc5309a

lhoestq requested review from albertvillanova and thomwolf May 18, 2021 18:20

lhoestq added 8 commits June 2, 2021 10:54

Merge branch 'master' into dataset-streaming

3008988

start tests

8579c71

more tests

080a083

more tests

6172953

add streaming argument to load_dataset

f4b84eb

allow streaming from private repos

daede36

Merge branch 'master' into dataset-streaming

e2f26dc

Revert "replace download_and_extract by simple download when there's …

064ab00

…no extraction in datasets scripts [A-D]" This reverts commit 97bce1c.

lhoestq added 4 commits June 9, 2021 19:13

add missing sections in oscar dataset card

6a6e21f

add gz support + add tests

39f717a

remove constrains on fsspec and s3fs for py3.6

c9acd44

fix test

f5cf3f3

lhoestq force-pushed the dataset-streaming branch from 4f8ee69 to f5cf3f3 Compare June 11, 2021 08:26

lhoestq added 2 commits June 11, 2021 11:31

fix test on windows

c1a63bf

style

1bf093f

albertvillanova reviewed Jun 11, 2021

View reviewed changes

src/datasets/utils/resources/languages.json Show resolved Hide resolved

rename to interleave_datasets + comments

20aba4d

lewtun reviewed Jun 18, 2021

View reviewed changes

albertvillanova reviewed Jun 18, 2021

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

lhoestq added 4 commits June 18, 2021 15:17

Merge branch 'master' into dataset-streaming

9c1a2e1

lewis' comments

b180bc8

typing in gzip

a657d03

move interleave_datasets in combine.py

cd23946

lhoestq commented Jun 18, 2021

View reviewed changes

lhoestq added 3 commits June 18, 2021 15:58

add pretty_name to OSCAR

7f19f4a

docs

5ab438c

Merge branch 'master' into dataset-streaming

8f68a43

lhoestq requested review from albertvillanova and lewtun June 22, 2021 09:48

lhoestq mentioned this pull request Jun 22, 2021

Add the 800GB Pile dataset? #1675

Closed

lewtun approved these changes Jun 22, 2021

View reviewed changes

thomwolf approved these changes Jun 22, 2021

View reviewed changes

src/datasets/combine.py Outdated Show resolved Hide resolved

lhoestq and others added 3 commits June 23, 2021 11:05

Update src/datasets/combine.py

0ee60af

Co-authored-by: Thomas Wolf <[email protected]>

fix docstring

ed9569a

docstrings again

593e229

lhoestq merged commit 3c49355 into master Jun 23, 2021

lhoestq deleted the dataset-streaming branch June 23, 2021 16:35

lhoestq mentioned this pull request Jul 2, 2021

Add skip and take #2582

Merged

	Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset
	Even though the dataset is 1.2 terabytes of data, you can start using it right away! Under the hood, it only downloaded the first example of the dataset.


		.. note::

		The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples.

Dataset Streaming #2375

Dataset Streaming #2375

Uh oh!

Conversation

lhoestq commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dataset Streaming

API

Implementation details

Streaming

Transforms

DatasetBuilder

Other details

TODO

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lhoestq commented May 18, 2021 •

edited

Loading