-
Notifications
You must be signed in to change notification settings - Fork 3k
Revert the changes in arrow_writer.py from #6636
#6664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
arrow_writer.py from #6636arrow_writer.py from #6636
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! We can't revert this as the "reverted" implementation has quadratic time complexity. Instead, let's fix it:
I agree, but it's the implementation we have had so far. Why don't we:
|
|
The fix is straightforward, so one patch release (after this PR is merged) is enough. Btw, let's also add a test to |
Would you mind adding such test, as you're more familiar with the codebase? |
| extra_cols = [col for col in batch_cols if col not in schema_cols] | ||
| cols = common_cols + extra_cols | ||
| else: | ||
| cols = list(batch_examples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should really avoid this extra copy, especially if the inner iterable is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This negligible optimization (write_examples_on_file and write_batch are called every writer_batch_size) goes against good code practices.
We wouldn't use Python for this project if we wanted to optimize every aspect of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's negligible. #6636's OP stated:
I work with bioinformatics data and often these tables have thousands and even tens of thousands of features.
We'd create a list of tens of thousands of strings for every batch, for every processing step (e.g., a map).
And it's easy to remove (just cols = batch_samples, instead of copying it into a list).
Among other things, this library is about large data processing efficiency, so I think it'd be nice to consider it.
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
(The CI failure is not related to the changes)
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|

#6636 broke
write_examples_on_fileandwrite_batchfrom the classArrowWriter. I'm undoing these changes. See #6663.Note the current implementation doesn't keep the order of the columns and the schema, thus setting a wrong schema for each column.