Faster map w/ input_columns & faster slicing w/ Iterable keys #2246

norabelrose · 2021-04-21T19:49:07Z

map now uses with_format to only load needed columns in memory when input_columns is set
Slicing datasets with Iterables of indices now uses a new Table.fast_gather method, implemented with np.searchsorted, to find the appropriate batch indices all at once. pa.concat_tables is no longer used for this; we just call pa.Table.from_batches with a list of all the batch slices.

Together these changes have sped up batched map() calls over subsets of columns quite considerably in my initial testing.

…guous" fix

lhoestq

Thanks a lot !
This looks all good to me :)

Could you just run make style to format the code and make the CI green ? :)

norabelrose · 2021-04-23T17:56:35Z

@lhoestq Just fixed the code style issues— I think it should be good to merge now :)

lhoestq

Thanks ! LGTM

We may need to improve the case when input_columns is used on a dataset formatted with output_all_columns=True. Maybe this is something we can take care of in another PR

Faster map w/ input_columns & faster slicing w/ Iterable keys

4bfc5ea

norabelrose mentioned this pull request Apr 21, 2021

Filtering/mapping on one column is very slow #2193

Closed

norabelrose marked this pull request as draft April 21, 2021 20:10

norabelrose added 2 commits April 21, 2021 13:19

Attempt at a fix for _query_table for negative indices

6772073

Silly "The truth value of an array with more than one element is ambi…

b5f246d

…guous" fix

norabelrose marked this pull request as ready for review April 21, 2021 20:49

lhoestq reviewed Apr 23, 2021

View reviewed changes

Code style fixes

b0511c6

lhoestq approved these changes Apr 26, 2021

View reviewed changes

lhoestq merged commit 5adf06a into huggingface:master Apr 26, 2021

This was referenced Apr 28, 2021

Bug in Dataset.class_encode_column #2272

Closed

Fix iterable interface expected by numpy #2270

Closed

mariosasko mentioned this pull request Sep 12, 2022

Preserve non-input_colums in Dataset.map if input_columns are specified #4971

Merged

albertvillanova mentioned this pull request Sep 22, 2022

Re-apply input columns change #5008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster map w/ input_columns & faster slicing w/ Iterable keys #2246

Faster map w/ input_columns & faster slicing w/ Iterable keys #2246

Uh oh!

norabelrose commented Apr 21, 2021

Uh oh!

lhoestq left a comment

Uh oh!

norabelrose commented Apr 23, 2021

Uh oh!

lhoestq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster map w/ input_columns & faster slicing w/ Iterable keys #2246

Faster map w/ input_columns & faster slicing w/ Iterable keys #2246

Uh oh!

Conversation

norabelrose commented Apr 21, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

norabelrose commented Apr 23, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants