Skip to content

Make interleave_datasets more robust #3064

@sbmaruf

Description

@sbmaruf

Is your feature request related to a problem? Please describe.
Right now there are few hiccups using interleave_datasets. Interleaved dataset iterates until the smallest dataset completes it's iterator. In this way larger datasets may not complete full epoch of iteration.
It creates new problems in calculation of epoch since there are no way to track which dataset from interleave_datasets completes how many epoch.

Describe the solution you'd like
For interleave_datasets module,

  • Add a boolean argument --stop-iter in interleave_datasets that enables dataset to either iterate infinite amount of time or not. That means it should not return StopIterator exception in case --stop-iter=False.
  • Internal list variable iter_cnt that explains how many times (in steps/epochs) each dataset iterates at a given point.
  • Add an argument --max-iter (list type) that explain maximum amount of time each of the dataset can iterate. After complete --max-iter of one dataset, other dataset should continue sampling and when all the dataset finish their respective --max-iter, only then return StopIterator

Note: I'm new to datasets api. May be these features are already there in the datasets.

Since multitask training is the latest trends, I believe this feature would make the datasets api more popular.

@lhoestq

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions