-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
Currently, dataset["audio"] will load EVERY element in the dataset in RAM, which can be quite big for an audio dataset.
Having an iterator (or sequence) type of object, would make inference with transformers 's pipeline easier to use and not so memory hungry.
Describe the solution you'd like
A clear and concise description of what you want to happen.
For a non breaking change:
for audio in dataset.iterate("audio"):
# {"array": np.array(...), "sampling_rate":...}For a breaking change solution (not necessary), changing the type of dataset["audio"] to a sequence type so that
pipe = pipeline(model="...")
for out in pipe(dataset["audio"]):
# {"text":....}could work
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
def iterate(dataset, key):
for item in dataset:
yield dataset[key]
for out in pipeline(iterate(dataset, "audio")):
# {"array": ...}This works but requires the helper function which feels slightly clunky.
Additional context
Add any other context about the feature request here.
The context is actually to showcase better integration between pipeline and datasets in the Quicktour demo: https://github.com/huggingface/transformers/pull/16723/files