Allow parquet readers to use existing `datasource`s and `metadata`s by mhaseeb123 · Pull Request #20693 · rapidsai/cudf

mhaseeb123 · 2025-11-21T01:32:52Z

Description

Contributes to #20311. Closes #18890

This PR enables cuDF parquet readers (chunked and non-chunked) to use pre-constructed datasource(s) and FileMetaData(s). This allows them to save compute time for re-reading footers when possible. This is particularly useful for workflows where one may want to only read file footers first, making some decisions based on that, and finally read parquet files without re-reading the footers.

Checklist

Merge Enable using multithreaded setup_page_index in hybrid scan reader #20721 before this PR
I am familiar with the Contributing Guidelines.
Python bindings - in a future PR
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-11-21T01:32:55Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mhaseeb123 · 2025-11-21T19:20:51Z

CC: @JigaoLuo

docs/cudf/source/conf.py

cpp/include/cudf/io/orc_metadata.hpp

cpp/include/cudf/io/detail/parquet.hpp

cpp/tests/io/parquet_chunked_reader_test.cu

…aseeb123/cudf into fea/read-parquet-with-pre-populated-footer

cpp/src/io/functions.cpp

cpp/src/io/parquet/reader_impl.cpp

copy-pr-bot · 2025-12-03T01:35:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vuule

partial review

cpp/src/io/parquet/experimental/hybrid_scan_helpers.hpp

cpp/src/io/parquet/reader_impl.cpp

cpp/include/cudf/io/parquet_metadata.hpp

cpp/src/io/parquet/reader_impl_helpers.cpp

JigaoLuo · 2025-12-05T21:34:16Z

does any of your use cases involve reading the same file via (datasource and metadata) again and again?

(Sorry for the delay. I missed your message as I was busy writing at the beginning of this month.)

As you may know, I let different threads read different RGs of the same Parquet file. This does not mean the datasource is read multiple times.

However, one rare but valid case is the self‑join, essentially joining a table with itself (e.g., A JOIN A) for filtering purposes. In this case, I still need to fully read table A once to build the hash table, and then read A again to probe the hash table.

There are a couple of reasons why we did not cache A in memory: 0. Caching costs us memory consumption. 1. We do not have any query optimizer. 2. We expect there may be cases where repeated reading is necessary, so we kept the design general.

This is more of a query processing & optimization discussion and slightly off‑topic for a reader of this issue. We could continue the discussion on Slack if that helps.

mhaseeb123 · 2025-12-10T21:38:46Z

pre-commit.ci autofix

mhaseeb123 · 2025-12-15T23:25:49Z

/ok to test 5600ae5

mhaseeb123 · 2025-12-17T01:50:38Z

/ok to test 87e3a2e

mhaseeb123 · 2025-12-17T18:37:49Z

/merge

Allow parquet readers to use pre-materialized metadatas

d01acf9

github-actions bot assigned mhaseeb123 Nov 21, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2025

Allow parquet readers to use external datasources

2c83509

mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Nov 21, 2025

github-project-automation bot added this to cuDF Python Nov 21, 2025

mhaseeb123 added the feature request New feature or request label Nov 21, 2025

Improve docs

2cb9738

GPUtester moved this to In Progress in cuDF Python Nov 21, 2025

Add more tests

67bd2bc

mhaseeb123 changed the title ~~Allow parquet readers to use pre-materialized metadatas~~ Allow parquet readers to use pre-existing datasources and metadatas Nov 21, 2025

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

8bede77

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 21, 2025

mhaseeb123 marked this pull request as ready for review November 21, 2025 18:25

mhaseeb123 requested a review from a team as a code owner November 21, 2025 18:25

mhaseeb123 requested review from pmattione-nvidia, vyasr and wence- November 21, 2025 18:25

JigaoLuo mentioned this pull request Nov 21, 2025

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

Closed

3 tasks

Fix docs

b81c66a

mhaseeb123 commented Nov 21, 2025

View reviewed changes

docs/cudf/source/conf.py Show resolved Hide resolved

mhaseeb123 commented Nov 21, 2025

View reviewed changes

cpp/include/cudf/io/orc_metadata.hpp Show resolved Hide resolved

mhaseeb123 commented Nov 21, 2025

View reviewed changes

cpp/include/cudf/io/detail/parquet.hpp Show resolved Hide resolved

mhaseeb123 commented Nov 21, 2025

View reviewed changes

cpp/tests/io/parquet_chunked_reader_test.cu Show resolved Hide resolved

cpp/tests/io/parquet_chunked_reader_test.cu Show resolved Hide resolved

Merge branch 'fea/multithreaded-setup-pgidx' of https://github.com/mh…

3795225

…aseeb123/cudf into fea/read-parquet-with-pre-populated-footer

JigaoLuo mentioned this pull request Nov 26, 2025

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

Open

pmattione-nvidia reviewed Dec 1, 2025

View reviewed changes

cpp/src/io/functions.cpp Show resolved Hide resolved

pmattione-nvidia reviewed Dec 1, 2025

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Show resolved Hide resolved

Address feedback

0ef323c

vuule reviewed Dec 3, 2025

View reviewed changes

cpp/src/io/parquet/experimental/hybrid_scan_helpers.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/parquet_metadata.hpp Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits December 3, 2025 18:56

Address partial feedback

d33c9ee

Rename read_parquet_metadata to read_parquet_footers

e056495

pmattione-nvidia approved these changes Dec 3, 2025

View reviewed changes

vuule approved these changes Dec 4, 2025

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Show resolved Hide resolved

Add some extra checks

9e00764

mhaseeb123 added 2 commits December 9, 2025 16:55

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

25b0702

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

f938501

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Dec 10, 2025

pre-commit-ci bot and others added 5 commits December 10, 2025 21:39

[pre-commit.ci] auto code formatting

d84f08f

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

9fbe8bd

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

b5a298f

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

4bc5e10

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

5600ae5

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

87e3a2e

rapids-bot bot merged commit 3711082 into rapidsai:main Dec 17, 2025
141 of 142 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Dec 17, 2025

mhaseeb123 deleted the fea/read-parquet-with-pre-populated-footer branch December 17, 2025 18:38

Conversation

mhaseeb123 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

mhaseeb123 commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 3, 2025

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JigaoLuo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhaseeb123 commented Dec 10, 2025

Uh oh!

mhaseeb123 commented Dec 15, 2025

Uh oh!

mhaseeb123 commented Dec 17, 2025

Uh oh!

mhaseeb123 commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mhaseeb123 commented Nov 21, 2025 •

edited

Loading

JigaoLuo commented Dec 5, 2025 •

edited

Loading