-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
good second issueIssues a bit more difficult than "Good First" issuesIssues a bit more difficult than "Good First" issues
Description
In #4831 @ylacombe added an oversampling strategy for interleave_datasets. However right now it doesn't work for datasets loaded using load_dataset(..., streaming=True), which are IterableDataset objects.
It would be nice to expand interleave_datasets for iterable datasets as well to support this oversampling strategy
>>> from datasets.iterable_dataset import IterableDataset, ExamplesIterable
>>> d1 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [0, 1, 2]], {}))
>>> d2 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [10, 11, 12, 13]], {}))
>>> d3 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [20, 21, 22, 23, 24]], {}))
>>> dataset = interleave_datasets([d1, d2, d3]) # is supported
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") # is not supported yet
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]This can be implemented by adding the strategy to both CyclingMultiSourcesExamplesIterable and RandomlyCyclingMultiSourcesExamplesIterable used in _interleave_iterable_datasets in iterable_dataset.py
I would be happy to share some guidance if anyone would like to give it a shot :)
Metadata
Metadata
Assignees
Labels
good second issueIssues a bit more difficult than "Good First" issuesIssues a bit more difficult than "Good First" issues