Skip to content

Add some iteration method on a dataset column (specific for inference) #4180

@Narsil

Description

@Narsil

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.

Currently, dataset["audio"] will load EVERY element in the dataset in RAM, which can be quite big for an audio dataset.
Having an iterator (or sequence) type of object, would make inference with transformers 's pipeline easier to use and not so memory hungry.

Describe the solution you'd like
A clear and concise description of what you want to happen.

For a non breaking change:

for audio in dataset.iterate("audio"):
    # {"array": np.array(...), "sampling_rate":...}

For a breaking change solution (not necessary), changing the type of dataset["audio"] to a sequence type so that

pipe = pipeline(model="...")
for out in pipe(dataset["audio"]):
    # {"text":....}

could work

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

def iterate(dataset, key):
    for item in dataset:
        yield dataset[key]

for out in pipeline(iterate(dataset, "audio")):
    # {"array": ...}

This works but requires the helper function which feels slightly clunky.

Additional context
Add any other context about the feature request here.

The context is actually to showcase better integration between pipeline and datasets in the Quicktour demo: https://github.com/huggingface/transformers/pull/16723/files

@lhoestq

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions