-
Notifications
You must be signed in to change notification settings - Fork 3k
Revert the changes in arrow_writer.py from #6636
#6664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c98dc7d
4612879
0160c13
73affc0
23a7880
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -426,20 +426,15 @@ def write_examples_on_file(self): | |
| """Write stored examples from the write-pool of examples. It makes a table out of the examples and write it.""" | ||
| if not self.current_examples: | ||
| return | ||
|
|
||
| # order the columns properly | ||
| # preserve the order the columns | ||
| if self.schema: | ||
| schema_cols = set(self.schema.names) | ||
| common_cols, extra_cols = [], [] | ||
| for col in self.current_examples[0][0]: | ||
| if col in schema_cols: | ||
| common_cols.append(col) | ||
| else: | ||
| extra_cols.append(col) | ||
| examples_cols = self.current_examples[0][0].keys() # .keys() preserves the order (unlike set) | ||
| common_cols = [col for col in self.schema.names if col in examples_cols] | ||
| extra_cols = [col for col in examples_cols if col not in schema_cols] | ||
| cols = common_cols + extra_cols | ||
| else: | ||
| cols = list(self.current_examples[0][0]) | ||
|
|
||
| batch_examples = {} | ||
| for col in cols: | ||
| # We use row[0][col] since current_examples contains (example, key) tuples. | ||
|
|
@@ -549,14 +544,12 @@ def write_batch( | |
| try_features = self._features if self.pa_writer is None and self.update_features else None | ||
| arrays = [] | ||
| inferred_features = Features() | ||
| # preserve the order the columns | ||
| if self.schema: | ||
| schema_cols = set(self.schema.names) | ||
| common_cols, extra_cols = [], [] | ||
| for col in batch_examples: | ||
| if col in schema_cols: | ||
| common_cols.append(col) | ||
| else: | ||
| extra_cols.append(col) | ||
| batch_cols = batch_examples.keys() # .keys() preserves the order (unlike set) | ||
| common_cols = [col for col in self.schema.names if col in batch_cols] | ||
| extra_cols = [col for col in batch_cols if col not in schema_cols] | ||
| cols = common_cols + extra_cols | ||
| else: | ||
| cols = list(batch_examples) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we should really avoid this extra copy, especially if the inner iterable is large.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This negligible optimization ( We wouldn't use Python for this project if we wanted to optimize every aspect of the API.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure it's negligible. #6636's OP stated:
We'd create a list of tens of thousands of strings for every batch, for every processing step (e.g., a And it's easy to remove (just Among other things, this library is about large data processing efficiency, so I think it'd be nice to consider it. |
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.