Skip to content

Allow parquet readers to use existing datasources and metadatas#20693

Merged
rapids-bot[bot] merged 24 commits intorapidsai:mainfrom
mhaseeb123:fea/read-parquet-with-pre-populated-footer
Dec 17, 2025
Merged

Allow parquet readers to use existing datasources and metadatas#20693
rapids-bot[bot] merged 24 commits intorapidsai:mainfrom
mhaseeb123:fea/read-parquet-with-pre-populated-footer

Conversation

@mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 21, 2025

Description

Contributes to #20311. Closes #18890

This PR enables cuDF parquet readers (chunked and non-chunked) to use pre-constructed datasource(s) and FileMetaData(s). This allows them to save compute time for re-reading footers when possible. This is particularly useful for workflows where one may want to only read file footers first, making some decisions based on that, and finally read parquet files without re-reading the footers.

Checklist

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2025
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Nov 21, 2025
@mhaseeb123 mhaseeb123 added the feature request New feature or request label Nov 21, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Nov 21, 2025
@mhaseeb123 mhaseeb123 changed the title Allow parquet readers to use pre-materialized metadatas Allow parquet readers to use pre-existing datasources and metadatas Nov 21, 2025
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 21, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review November 21, 2025 18:25
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner November 21, 2025 18:25
@mhaseeb123
Copy link
Member Author

CC: @JigaoLuo

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Dec 5, 2025

does any of your use cases involve reading the same file via (datasource and metadata) again and again?

(Sorry for the delay. I missed your message as I was busy writing at the beginning of this month.)

As you may know, I let different threads read different RGs of the same Parquet file. This does not mean the datasource is read multiple times.

However, one rare but valid case is the self‑join, essentially joining a table with itself (e.g., A JOIN A) for filtering purposes. In this case, I still need to fully read table A once to build the hash table, and then read A again to probe the hash table.

There are a couple of reasons why we did not cache A in memory: 0. Caching costs us memory consumption. 1. We do not have any query optimizer. 2. We expect there may be cases where repeated reading is necessary, so we kept the design general.

This is more of a query processing & optimization discussion and slightly off‑topic for a reader of this issue. We could continue the discussion on Slack if that helps.

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Dec 10, 2025
@mhaseeb123
Copy link
Member Author

pre-commit.ci autofix

@mhaseeb123
Copy link
Member Author

/ok to test 5600ae5

@mhaseeb123
Copy link
Member Author

/ok to test 87e3a2e

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 3711082 into rapidsai:main Dec 17, 2025
141 of 142 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Dec 17, 2025
@mhaseeb123 mhaseeb123 deleted the fea/read-parquet-with-pre-populated-footer branch December 17, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cudf-polars Issues specific to cudf-polars cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA] Parquet metadata caching due to overhead in reader

7 participants