Skip to content

Oversampling strategy for iterable datasets in interleave_datasets #4893

@lhoestq

Description

@lhoestq

In #4831 @ylacombe added an oversampling strategy for interleave_datasets. However right now it doesn't work for datasets loaded using load_dataset(..., streaming=True), which are IterableDataset objects.

It would be nice to expand interleave_datasets for iterable datasets as well to support this oversampling strategy

>>> from datasets.iterable_dataset import IterableDataset, ExamplesIterable
>>> d1 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [0, 1, 2]], {}))
>>> d2 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [10, 11, 12, 13]], {}))
>>> d3 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [20, 21, 22, 23, 24]], {}))
>>> dataset = interleave_datasets([d1, d2, d3])  # is supported
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")  # is not supported yet
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]

This can be implemented by adding the strategy to both CyclingMultiSourcesExamplesIterable and RandomlyCyclingMultiSourcesExamplesIterable used in _interleave_iterable_datasets in iterable_dataset.py

I would be happy to share some guidance if anyone would like to give it a shot :)

Metadata

Metadata

Assignees

Labels

good second issueIssues a bit more difficult than "Good First" issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions