-
Notifications
You must be signed in to change notification settings - Fork 3k
Faster column validation and reordering #6636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster column validation and reordering #6636
Conversation
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Besides Dataset.set_format, we also use this inefficient check in the following methods:
Dataset.select_columnsDataset.remove_columnsDataset.rename_columnsDataset.map(input_columnsandremove_columnschecks)IterableDataset.select_columns
So, let's also update them in this PR to be consistent across the codebase.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Mario Šaško <[email protected]>
…th94/datasets into faster-column-validation-set-format
|
Thanks @mariosasko, I made the changes. However, I did some tests with Edit: Ah just realized you can avoid the issue with inferring features altogether when you set the format to arrow (or pandas). |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, let's also improve this column reordering logic to avoid quadratic time complexity.
Two comments to fix the CI failure.
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty to apply the code suggestions so that we can include this PR in the next release.
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Undo the changes in `arrow_writer.py` from #6636 See #6663. * Add test * Apply suggestions from code review * Nits --------- Co-authored-by: mariosasko <[email protected]>

I work with bioinformatics data and often these tables have thousands and even tens of thousands of features. These tables are also accompanied by metadata that I do not want to pass in the model. When I perform
set_format('pt', columns=large_column_list), it can take several minutes before it finishes. The culprit is when the following check is performed:any(col not in self._data.column_names for col in columns). Replacing this byset(columns) - (self._data.column_names)is more efficient.