[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader#18891
[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader#18891JigaoLuo wants to merge 1 commit intorapidsai:branch-25.06from
Conversation
Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
|
My approach above is to make sure the footer is read only once during a call to How to Use (Breaking API Changes)Decouple metadata parsing from data reads: auto metadata = cudf::io::read_parquet_metadata(source_info);
auto aggregate_reader_metadata_ptr = metadata.get_aggregate_reader_metadata_ptr(); // new
auto options = cudf::io::parquet_reader_options::builder(source_info).build();
options.set_aggregate_reader_metadata(aggregate_reader_metadata_ptr); // new
cudf::io::read_parquet(options);You can find the example I provided: cpp/examples/parquet_io/parquet_io_metadata_caching.cpp Key BenefitsThere are two concrete benefits and also provided in the example code cpp/examples/parquet_io/parquet_io_metadata_caching.cpp. The parquet file is the same as referenced in the issue as a running example. 1. Bulk Read OptimizationA Bulk Read is a single It makes sense to see that, once the metadata is cached, the total read time is reduced to milliseconds-level. It also matches the nsys result I show in the issue. ( 2. Use case: Rowgroup IterationA use case that is not efficient in libcudf is iteratively reading Parquet at the rowgroup level: reading one rowgroup, processing it, and repeating the process. You will get the idea from the last point: without metadata caching, the accumulated overhead to pay is With metadata caching, each rowgroup takes 10ms to read: SummaryThe short conclusion is: the metadata caching has been beneficial to significantly speed up Parquet reading. With it, GPU kernels no longer have to wait for metadata thrift-decoding. PR Discussion Items:
|
|
Some super-early comments: I think instead of caching/passing around an internal class ( |
|
@mhaseeb123 Got it, thanks! I’ll tackle this after finalizing the story issue. Once addressed, I’ll request another review. Update:
Also I found the It is necessary to discuss how to structure dependencies for metadata caching, as this directly impacts API design. Here are the some options:
Both Option 1 and Option 2 have trade-offs that require careful discussion. |
|
Close as #20693 is being merged 🚀 |


Description
For the issue #18890, this draft PR demonstrates the performance benefits of decoupling metadata parsing from data page reads in the Parquet reader. The goal is to:
read_parquetChecklist