🚧 Implement an experimental Parquet reader optimized for highly-selective hybrid scan reads#18011
Closed
mhaseeb123 wants to merge 109 commits intorapidsai:branch-25.06from
Closed
🚧 Implement an experimental Parquet reader optimized for highly-selective hybrid scan reads#18011mhaseeb123 wants to merge 109 commits intorapidsai:branch-25.06from
mhaseeb123 wants to merge 109 commits intorapidsai:branch-25.06from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
mhaseeb123
commented
Feb 19, 2025
| @@ -0,0 +1,224 @@ | |||
| /* | |||
Member
Author
There was a problem hiding this comment.
Copy pasted from reader_impl_chunking.cu for now. No need to review
mhaseeb123
commented
Feb 19, 2025
| @@ -0,0 +1,82 @@ | |||
| /* | |||
Member
Author
There was a problem hiding this comment.
Copy pasted from reader_impl_preprocess.cu. No need to review.
rapids-bot bot
pushed a commit
that referenced
this pull request
Apr 30, 2025
… metadata APIs (#18480) Contributes to #17896. Part of #18011. This PR adds the high level interface (APIs) to a new experimental Parquet reader optimized for highly selective (hybrid scan) queries. The PR also adds implementations for the basic metadata related APIs of the new reader such as reading the file footer and PageIndex. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec URL: #18480
mhaseeb123
added a commit
to mhaseeb123/cudf
that referenced
this pull request
Apr 30, 2025
rapids-bot bot
pushed a commit
that referenced
this pull request
May 6, 2025
) Contributes to #17896. Part of #18011. This PR implements row group pruning with stats in the experimental Parquet reader optimized for hybrid scan queries Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #18543
6 tasks
rapids-bot bot
pushed a commit
that referenced
this pull request
May 15, 2025
…der (#18545) Contributes to #17896. Part of #18011. This PR implements row group pruning with bloom filters in the experimental Parquet reader optimized for hybrid scan queries. Dictionary based row group pruning is still WIP in a separate branch and so this PR has empty definitions where needed. Note: Unfortunately, we can't add any tests for this feature as we don't yet have capability of writing parquet files with bloom filters. However, the code that filters row groups with bloom filters is identical to already tested code at: https://github.com/rapidsai/cudf/blob/branch-25.06/cpp/src/io/parquet/predicate_pushdown.cpp#L198-L240 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #18545
This was referenced May 19, 2025
rapids-bot bot
pushed a commit
that referenced
this pull request
Jun 30, 2025
Contributes to #17896. Part of #18011. Implements feature request in #9269 This PR implements discarding of Parquet data pages using the page level (min/max) statistics contained in the page index section of a parquet file, in the experimental Parquet reader for optimizing hybrid scan queries. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Shruti Shivakumar (https://github.com/shrshi) URL: #18873
rapids-bot bot
pushed a commit
that referenced
this pull request
Jul 7, 2025
…er (#18836) Contributes to #17896. Part of #18011. Closes #18046 This PR implements row group pruning using dictionary pages of parquet column chunks in the experimental Parquet reader for optimizing hybrid scan queries. ## Tasklist - [x] Code cleanup and add comments - [x] Add tests with more complex types and predicates - [x] Add special handling for FLBAs and INT96 type if needed Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #18836
This was referenced Jul 8, 2025
Member
Author
|
Closing as this is completed by #19308 |
rapids-bot bot
pushed a commit
that referenced
this pull request
Jul 24, 2025
Contributes to #17896. Completes #18011 This PR implements table materialization functions in the experimental parquet reader. The experimental reader now derives from the base parquet reader and only overloads the necessary functions reusing the base functions wherever possible. Most of the functions reimplemented by the experimental reader are also mostly identical with the differences from the base reader mentioned in the comments Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #19308
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
🚧 Closes #17896
This PR implements an experimental Parquet reader optimized for highly-selective hybrid scan reads. The new experimental reader provides APIs to prune row groups and data pages based on the AST filter expression.
One pruning is complete, the parquet data itself is materialized into the table in two passes. The first pass only materializes the filter columns (columns that appear in the filter expression, also called predicate columns) and the second pass only materializes (optionally select) payload columns (columns that don't appear in the filter expression).
Note that it is now the responsibility of the caller to fetch the specified byte ranges from the parquet source and provide them to the reader.
Currently, the experimental reader materializes the tables in either pass all in one go without support for chunking. Currently, only single parquet source reading is supported.
Checklist