Fast dataset iter #5030

mariosasko · 2022-09-27T16:44:51Z

Use pa.Table.to_reader to make iteration over examples/batches faster in Dataset.{__iter__, map}

TODO:

benchmarking (the only benchmark for now - iterating over (single) examples of bookcorpus (75 mil examples) in Colab is approx. 2.3x faster)
check if iterating over bigger chunks + slicing to fetch individual examples in _iter yields better performance

HuggingFaceDocBuilderDev · 2022-09-27T16:52:43Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko · 2022-09-28T17:56:20Z

I ran some benchmarks (focused on the data fetching part of __iter__) and it seems like the combination table.to_reader(batch_size) + RecordBatch.slice performs the best (script with the results). I think we can choose (implicit) batch_size=10 in the final implementation to avoid having problems with fetching large examples.

lhoestq

Nice thank you ! The benchmarks are super helpful as well, I think you can link to them in comments in the code

src/datasets/arrow_dataset.py

src/datasets/config.py

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <[email protected]>

Fast dataset iter

3025024

Final improvements + some minor fixes

ae71a31

mariosasko requested review from albertvillanova and lhoestq September 29, 2022 13:10

lhoestq reviewed Sep 29, 2022

View reviewed changes

Update src/datasets/arrow_dataset.py

6701bc9

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq approved these changes Sep 29, 2022

View reviewed changes

Address comments

a538700

mariosasko merged commit 1ea4d09 into main Sep 29, 2022

mariosasko deleted the iter-with-reader branch September 29, 2022 15:48

This was referenced Oct 14, 2022

Bug with filtered indices #5112

Closed

Fix filter indices when batched #5113

Merged

lhoestq mentioned this pull request Oct 14, 2022

Fix iter_batches #5115

Merged

2 tasks

lhoestq mentioned this pull request Dec 7, 2022

Slow dataloading with big datasets issue persists #2252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast dataset iter #5030

Fast dataset iter #5030

Uh oh!

mariosasko commented Sep 27, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Sep 27, 2022 •

edited

Loading

Uh oh!

mariosasko commented Sep 28, 2022

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fast dataset iter #5030

Fast dataset iter #5030

Uh oh!

Conversation

mariosasko commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko commented Sep 28, 2022

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mariosasko commented Sep 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2022 •

edited

Loading