Skip to content

Conversation

@thinkharderdev
Copy link
Collaborator

Part of topk optimizations in DQE:

  1. Modify the parquet reader to allow us to project a row_id column which will contain the absolute row offset of each returned row in the underlying file. This is what allows us to build a row selection once we have a topk
  2. Add a prefetch capability which allows us to prefetch columns when evaluating row filters. The idea being that it could be better to prefetch the columns we need to decode (if they are small) rather than wait for the row selection returned from our row filters. Did not end up using this but I think it may be useful if used more intelligently in DQE so leaving it in

@thinkharderdev thinkharderdev requested a review from a team May 5, 2025 16:45
Copy link

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I read through the tests and they are great. I don't have enough context on the implementation to give much feedback but it was pretty straightforward.

batch_size,
array_reader,
apply_range(selection, reader.num_rows(), self.offset, self.limit),
// TODO what do we do here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out if None was fine here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine for us because we only use the sync reader but a challenge for upstreaming.

@thinkharderdev thinkharderdev merged commit 35b8115 into v53 May 7, 2025
9 of 18 checks passed
@thinkharderdev thinkharderdev deleted the rowid branch May 7, 2025 16:36
avantgardnerio pushed a commit that referenced this pull request Dec 17, 2025
avantgardnerio pushed a commit that referenced this pull request Dec 17, 2025
avantgardnerio pushed a commit that referenced this pull request Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants