Skip to content

Conversation

@norabelrose
Copy link
Contributor

@lhoestq Fixes #2193

  • map now uses with_format to only load needed columns in memory when input_columns is set
  • Slicing datasets with Iterables of indices now uses a new Table.fast_gather method, implemented with np.searchsorted, to find the appropriate batch indices all at once. pa.concat_tables is no longer used for this; we just call pa.Table.from_batches with a list of all the batch slices.

Together these changes have sped up batched map() calls over subsets of columns quite considerably in my initial testing.

@norabelrose norabelrose marked this pull request as ready for review April 21, 2021 20:49
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot !
This looks all good to me :)

Could you just run make style to format the code and make the CI green ? :)

@norabelrose
Copy link
Contributor Author

@lhoestq Just fixed the code style issues— I think it should be good to merge now :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM

We may need to improve the case when input_columns is used on a dataset formatted with output_all_columns=True. Maybe this is something we can take care of in another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filtering/mapping on one column is very slow

2 participants