-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Dataset Streaming #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Dataset Streaming #2375
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
00d69e0
add streaming module
lhoestq 587bbb9
make oscar streaming compatible
lhoestq ed6f7c3
minor
lhoestq bcd5b92
minor
lhoestq 00a4a57
use right name for format type
lhoestq 45bd319
add shuffle
lhoestq ca4a4b3
shuffle examples generator's source data files
lhoestq 8abcc73
add transform_batch_size
lhoestq 40e6014
clean shuffling buffer
lhoestq bd9b940
iterable_dataset factory
lhoestq 7fb47d9
support hub datasets and arrow based builder
lhoestq c461499
Merge branch 'master' into streaming
lhoestq ff0702d
add merge_datasets with probabilities
lhoestq f14b51b
style
lhoestq eae82d2
add retries if server disconnection occurs
lhoestq 5b7843a
add aiohttp to setup.py
lhoestq a3df1e4
allow streaming from zip files
lhoestq 97bce1c
replace download_and_extract by simple download when there's no extra…
lhoestq 4b655a9
Merge branch 'master' into streaming
lhoestq dc5309a
re-organize code
lhoestq 3008988
Merge branch 'master' into dataset-streaming
lhoestq 8579c71
start tests
lhoestq 080a083
more tests
lhoestq 6172953
more tests
lhoestq f4b84eb
add `streaming` argument to `load_dataset`
lhoestq daede36
allow streaming from private repos
lhoestq e2f26dc
Merge branch 'master' into dataset-streaming
lhoestq 064ab00
Revert "replace download_and_extract by simple download when there's …
lhoestq bbd1389
fix import
lhoestq ed174e5
use py int instead of np int
lhoestq 5fd7edb
start documentation
lhoestq 79014f8
add batched parameter, add n_shards
lhoestq 369d238
import from main init
lhoestq 602b985
docs
lhoestq 42da548
Merge branch 'master' into dataset-streaming
lhoestq ba7bbca
docs
lhoestq 4fa549e
fix docs
lhoestq 4fa1e0f
add missing language codes for oscar using pycountry
lhoestq 6a6e21f
add missing sections in oscar dataset card
lhoestq 39f717a
add gz support + add tests
lhoestq c9acd44
remove constrains on fsspec and s3fs for py3.6
lhoestq f5cf3f3
fix test
lhoestq c1a63bf
fix test on windows
lhoestq 1bf093f
style
lhoestq 20aba4d
rename to interleave_datasets + comments
lhoestq 9c1a2e1
Merge branch 'master' into dataset-streaming
lhoestq b180bc8
lewis' comments
lhoestq a657d03
typing in gzip
lhoestq cd23946
move interleave_datasets in combine.py
lhoestq 7f19f4a
add pretty_name to OSCAR
lhoestq 5ab438c
docs
lhoestq 8f68a43
Merge branch 'master' into dataset-streaming
lhoestq 0ee60af
Update src/datasets/combine.py
lhoestq ed9569a
fix docstring
lhoestq 593e229
docstrings again
lhoestq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| Load a Dataset in Streaming mode | ||
| ============================================================== | ||
|
|
||
| When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset. | ||
| The data are downloaded progressively as you iterate over the dataset. | ||
| You can enable dataset streaming by passing ``streaming=True`` in the :func:`load_dataset` function to get an iterable dataset. | ||
|
|
||
| This is useful if you don't have enough space on your disk to download the dataset, or if you don't want to wait for your dataset to be downloaded before using it. | ||
|
|
||
| Here is a demonstration: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) | ||
| >>> print(next(iter(dataset))) | ||
| {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... | ||
|
|
||
| Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset for buffering, and returned the first example. | ||
|
|
||
| .. note:: | ||
|
|
||
| The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples. | ||
| Therefore iterable datasets are mostly useful for iterative jobs like training a model, but not for jobs that require random access of examples. | ||
|
|
||
|
|
||
| Shuffling the dataset: ``shuffle`` | ||
| -------------------------------------------------- | ||
|
|
||
| Shuffle the dataset | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| To shuffle your dataset, the :func:`datasets.IterableDataset.shuffle` method fills a buffer of size ``buffer_size`` and randomly samples examples from this buffer. | ||
| The selected examples in the buffer are replaced by new examples. | ||
|
|
||
| For instance, if your dataset contains 1,000,000 examples but ``buffer_size`` is set to 1,000, then shuffle will initially select a random examples from only the first 1,000 examples in the buffer. | ||
| Once an example is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) example, maintaining the 1,000 example buffer. | ||
|
|
||
| .. note:: | ||
| For perfect shuffling, you need to set ``buffer_size`` to be greater than the size of your dataset. But in this case it will download the full dataset in the buffer. | ||
|
|
||
| Moreover, for larger datasets that are sharded into multiple files, :func:`datasets.IterableDataset.shuffle` also shuffles the order of the shards. | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42) | ||
| >>> print(next(iter(shuffled_dataset))) | ||
| {text': 'In this role, she oversees the day-to-day operations of the agency’s motoring services divisions (Vehicle Titles & Registration, Motor Vehicles, Motor Carrier, Enforcement, Consumer Relations and the Automobile Burglary & Theft Prevention Authority) to ensure they are constantly improving and identifying opportunities to become more efficient and effective in service delivery... | ||
| >>> print(dataset.n_shards) | ||
| 670 | ||
|
|
||
| In this example, the shuffle buffer contains 10,000 examples that were downloaded from one random shard of the dataset (here it actually comes from the 480-th shard out of 670). | ||
| The example was selected randomly from this buffer, and replaced by the 10,001-st example of the dataset shard. | ||
|
|
||
| Reshuffle the dataset at each epoch | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The seed used to shuffle the dataset is the one you specify in :func:`datasets.IterableDataset.shuffle`. But often we want to use another seed after each epoch to reshuffle the dataset. | ||
| Therefore between epochs you can simply tell the dataset at what epoch you're at, and the data will be shuffled using an effective seed of ``seed + epoch``. | ||
|
|
||
| For example your training loop can look like this: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> for epoch in range(epochs): | ||
| ... shuffled_dataset.set_epoch(epoch) | ||
| ... for example in shuffled_dataset: | ||
| ... ... | ||
|
|
||
| In this case in the first epoch, the dataset is shuffled with ``seed + 0`` and in the second epoch it is shuffled with ``seed + 1``, making your dataset reshuffled at each epoch. It randomizes both the shuffle buffer and the shards order. | ||
|
|
||
|
|
||
| Processing data with ``map`` | ||
| -------------------------------------------------- | ||
|
|
||
| As for :class:`datasets.Dataset` objects, you can process your data using ``map``. This is useful if you want to transform the data or rename/remove columns. | ||
| Since the examples of an :class:`datasets.IterableDataset` are downloaded progressively, the :func:`datasets.IterableDataset.map` method processes the examples on-the-fly when you are iterating over the dataset (contrary to :func:`datasets.Dataset.map` which processes all the examples directly). | ||
|
|
||
| This example shows how to tokenize your dataset: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from transformers import AutoTokenizer | ||
| >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") | ||
| >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"])) | ||
| >>> print(next(iter(tokenized_dataset))) | ||
| {'input_ids': [101, 11047, 10497, 7869, 2352...], 'token_type_ids': [0, 0, 0, 0, 0...], 'attention_mask': [1, 1, 1, 1, 1...]} | ||
|
|
||
| Tokenizers are written in Rust and use parallelism to speed up tokenization. To leverage parallelism, you can process the examples batch by batch. Note that the output examples are still returned one by one. | ||
|
|
||
| >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True) # default batch_size is 1000 but you can specify another batch_size if needed | ||
| >>> print(next(iter(tokenized_dataset))) | ||
| {'input_ids': [101, 11047, 10497, 7869, 2352...], 'token_type_ids': [0, 0, 0, 0, 0...], 'attention_mask': [1, 1, 1, 1, 1...]} | ||
|
|
||
|
|
||
| Mix several iterable datasets together with ``interleave_datasets`` | ||
| ---------------------------------------------------------------------------------------------------- | ||
|
|
||
| It is common to use several datasets to use a model. For example BERT was trained on a mix of Wikipedia and BookCorpus. | ||
| You can mix several iterable datasets together using :func:`datasets.interleave_datasets`. | ||
|
|
||
| By default, the resulting dataset alternates between the original datasets, but can also define sampling probabilities to sample randomly from the different datasets. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wow, this is very cool! |
||
|
|
||
| For example if you want a dataset in several languages: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets import interleave_datasets | ||
| >>> from itertools import islice | ||
| >>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) | ||
| >>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True) | ||
| >>> | ||
| >>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset]) | ||
| >>> print(list(islice(multilingual_dataset, 2))) | ||
| [{'text': 'Mtendere Village was inspired by the vision...}, {'text': "Média de débat d'idées, de culture et de littérature....}] | ||
| >>> | ||
| >>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42) | ||
| >>> print(list(islice(multilingual_dataset_with_oversampling, 2))) | ||
| [{'text': 'Mtendere Village was inspired by the vision...}, {'text': 'Lily James cannot fight the music...}] | ||
|
|
||
|
|
||
| Working with NumPy, pandas, PyTorch and TensorFlow | ||
| -------------------------------------------------- | ||
|
|
||
| This part is still experimental and breaking changes may happen in the near future. | ||
|
|
||
| It is possible to get a ``torch.utils.data.IterableDataset`` from a :class:`datasets.IterableDataset` by setting the dataset format to "torch", as for a :class:`datasets.Dataset`: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> import torch | ||
| >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], return_tensors="pt")) | ||
| >>> torch_tokenized_dataset = tokenized_dataset.with_format("torch") | ||
| >>> assert isinstance(torch_tokenized_dataset, torch.utils.data.IterableDataset) | ||
| >>> print(next(iter(torch_tokenized_dataset))) | ||
| {'input_ids': tensor([[101, 11047, 10497, 7869, 2352...]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0...]]), 'attention_mask': tensor([[1, 1, 1, 1, 1...]])} | ||
|
|
||
| For now, only the PyTorch format is supported but support for TensorFlow and others will be added soon. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create a classic
Datasetfrom anIterableDataset?one application that i have in mind is picking the first N examples of a huge dataset, collecting them in a standard
Datasetand then doing all my exploration / preprocessing / task preparation etc on that dataset.e.g. something like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure definitely :)
I was thinking of adding something like this in a next PR.
Maybe
IterableDataset.to_map_style_dataset()?To get only the first examples we can also add
IterableDataset.take(n)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, both those features would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the
IterableDataset.take(n)as well. Could we also have aIterableDataset.sample(n)taking a random sample?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a random sample would be very neat as well. here we might want to use something like reservoir sampling to deal with unbounded streams: https://en.wikipedia.org/wiki/Reservoir_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we have
.take(), you can doiterable_dataset.shuffle(buffer_size=buffer_size, seed=seed).take(n)to take random samples.This could be simplified by adding a
.sample()method indeedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about slicing support i.e
iterable_dataset[100:200]to get an iterator orDatasetat a particular slice?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid allowing users to get items using
__getitem__since it's not a map-style dataset.So I agree it would be nice to get a slice of the data, but with a different API. Maybe something like
What do you think ?
This is pretty close to the tf.data.Dataset API, which is also an iterable dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very cool @lhoestq