Add `row_id` and `prefetch` to parquet reader #65

thinkharderdev · 2025-05-05T16:45:03Z

Part of topk optimizations in DQE:

Modify the parquet reader to allow us to project a row_id column which will contain the absolute row offset of each returned row in the underlying file. This is what allows us to build a row selection once we have a topk
Add a prefetch capability which allows us to prefetch columns when evaluating row filters. The idea being that it could be better to prefetch the columns we need to decode (if they are small) rather than wait for the row selection returned from our row filters. Did not end up using this but I think it may be useful if used more intelligently in DQE so leaving it in

…pache#6041)" This reverts commit 741bbf6. # Conflicts: # arrow-flight/Cargo.toml # arrow-flight/gen/Cargo.toml # arrow-flight/src/arrow.flight.protocol.rs # arrow-integration-testing/Cargo.toml

This reverts commit 2983dc1.

This reverts commit 244d8bd.

…pache#6041)" This reverts commit 7750691.

nathanielc

LGTM. I read through the tests and they are great. I don't have enough context on the implementation to give much feedback but it was pretty straightforward.

nathanielc · 2025-05-06T15:31:19Z

parquet/src/arrow/arrow_reader/mod.rs

            batch_size,
            array_reader,
            apply_range(selection, reader.num_rows(), self.offset, self.limit),
+            // TODO what do we do here?


Did you figure out if None was fine here?

It is fine for us because we only use the sync reader but a challenge for upstreaming.

alamb and others added 13 commits May 1, 2025 09:41

Revert "bump tonic to 0.12 and prost to 0.13 for arrow-flight (a…

7750691

…pache#6041)" This reverts commit 741bbf6. # Conflicts: # arrow-flight/Cargo.toml # arrow-flight/gen/Cargo.toml # arrow-flight/src/arrow.flight.protocol.rs # arrow-integration-testing/Cargo.toml

Revert "fix: enable TLS roots for flight CLI client (apache#6640)"

244d8bd

This reverts commit 2983dc1.

Add rowid to parquet reader

21709a7

make sure stream has correct schema

a9a868b

handle specifying row groups

21c10c9

remove println

08b9063

add prefetching for row filter fetch

5a4b503

remove println

44c2b6f

fix bug in prefetch

4d536bd

fix properly

26eb835

remove println

35be6fa

chrono dep

484ebd8

use row_id intead of rowid

91e283d

thinkharderdev requested a review from a team May 5, 2025 16:45

github-actions bot added arrow-flight labels May 5, 2025

thinkharderdev added 2 commits May 5, 2025 12:50

Reapply "fix: enable TLS roots for flight CLI client (apache#6640)"

9dd419b

This reverts commit 244d8bd.

Reapply "bump tonic to 0.12 and prost to 0.13 for arrow-flight (a…

fc0409e

…pache#6041)" This reverts commit 7750691.

github-actions bot removed arrow-flight labels May 5, 2025

nathanielc approved these changes May 6, 2025

View reviewed changes

thinkharderdev merged commit 35b8115 into v53 May 7, 2025
9 of 18 checks passed

thinkharderdev deleted the rowid branch May 7, 2025 16:36

avantgardnerio pushed a commit that referenced this pull request Dec 17, 2025

Add row_id and prefetch to parquet reader (#65)

3eea6e1

avantgardnerio pushed a commit that referenced this pull request Dec 17, 2025

Add row_id and prefetch to parquet reader (#65)

ffad6f6

avantgardnerio pushed a commit that referenced this pull request Dec 23, 2025

Add row_id and prefetch to parquet reader (#65)

02dff9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `row_id` and `prefetch` to parquet reader #65

Add `row_id` and `prefetch` to parquet reader #65

Uh oh!

thinkharderdev commented May 5, 2025

Uh oh!

nathanielc left a comment

Uh oh!

nathanielc May 6, 2025

Uh oh!

thinkharderdev May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add row_id and prefetch to parquet reader #65

Add row_id and prefetch to parquet reader #65

Uh oh!

Conversation

thinkharderdev commented May 5, 2025

Uh oh!

nathanielc left a comment

Choose a reason for hiding this comment

Uh oh!

nathanielc May 6, 2025

Choose a reason for hiding this comment

Uh oh!

thinkharderdev May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `row_id` and `prefetch` to parquet reader #65

Add `row_id` and `prefetch` to parquet reader #65