-
Notifications
You must be signed in to change notification settings - Fork 0
Add row_id and prefetch to parquet reader
#65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pache#6041)" This reverts commit 741bbf6. # Conflicts: # arrow-flight/Cargo.toml # arrow-flight/gen/Cargo.toml # arrow-flight/src/arrow.flight.protocol.rs # arrow-integration-testing/Cargo.toml
This reverts commit 2983dc1.
This reverts commit 244d8bd.
…pache#6041)" This reverts commit 7750691.
nathanielc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I read through the tests and they are great. I don't have enough context on the implementation to give much feedback but it was pretty straightforward.
| batch_size, | ||
| array_reader, | ||
| apply_range(selection, reader.num_rows(), self.offset, self.limit), | ||
| // TODO what do we do here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you figure out if None was fine here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine for us because we only use the sync reader but a challenge for upstreaming.
Part of topk optimizations in DQE:
row_idcolumn which will contain the absolute row offset of each returned row in the underlying file. This is what allows us to build a row selection once we have a topkprefetchcapability which allows us to prefetch columns when evaluating row filters. The idea being that it could be better to prefetch the columns we need to decode (if they are small) rather than wait for the row selection returned from our row filters. Did not end up using this but I think it may be useful if used more intelligently in DQE so leaving it in