Oversampling strategy for iterable datasets in `interleave_datasets`

In https://github.com/huggingface/datasets/pull/4831 @ylacombe added an oversampling strategy for `interleave_datasets`. However right now it doesn't work for datasets loaded using `load_dataset(..., streaming=True)`, which are `IterableDataset` objects.

It would be nice to expand `interleave_datasets` for iterable datasets as well to support this oversampling strategy

```python
>>> from datasets.iterable_dataset import IterableDataset, ExamplesIterable
>>> d1 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [0, 1, 2]], {}))
>>> d2 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [10, 11, 12, 13]], {}))
>>> d3 = IterableDataset(ExamplesIterable(lambda: [(yield i, {"a": i}) for i in [20, 21, 22, 23, 24]], {}))
>>> dataset = interleave_datasets([d1, d2, d3])  # is supported
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")  # is not supported yet
>>> [x["a"] for x in dataset]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 0, 24]
```

This can be implemented by adding the strategy to both `CyclingMultiSourcesExamplesIterable` and `RandomlyCyclingMultiSourcesExamplesIterable` used in `_interleave_iterable_datasets` in `iterable_dataset.py`

I would be happy to share some guidance if anyone would like to give it a shot :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Oversampling strategy for iterable datasets in `interleave_datasets` #4893

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Oversampling strategy for iterable datasets in interleave_datasets #4893

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Oversampling strategy for iterable datasets in `interleave_datasets` #4893