Make `interleave_datasets` more robust

**Is your feature request related to a problem? Please describe.**
Right now there are few hiccups using `interleave_datasets`. Interleaved dataset iterates until the smallest dataset completes it's iterator. In this way larger datasets may not complete full epoch of iteration. 
It creates new problems in calculation of epoch since there are no way to track which dataset from `interleave_datasets` completes how many epoch.

**Describe the solution you'd like**
For `interleave_datasets` module, 
- [ ] Add a boolean argument `--stop-iter` in `interleave_datasets` that enables dataset to either iterate infinite amount of time or not. That means it should not return `StopIterator` exception in case `--stop-iter=False`.
- [ ] Internal list variable `iter_cnt` that explains how many times (in steps/epochs) each dataset iterates at a given point.
- [ ] Add an argument `--max-iter` (list type) that explain maximum amount of time each of the dataset can iterate. After complete  `--max-iter` of one dataset, other dataset should continue sampling and when all the dataset finish their respective `--max-iter`, only then return `StopIterator`

Note: I'm new to `datasets` api. May be these features are already there in the datasets. 

Since multitask training is the latest trends, I believe this feature would make the `datasets` api more popular.

@lhoestq 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `interleave_datasets` more robust #3064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make interleave_datasets more robust #3064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make `interleave_datasets` more robust #3064