Allow parquet readers to use existing datasources and metadatas#20693
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
datasources and metadatas
|
CC: @JigaoLuo |
…aseeb123/cudf into fea/read-parquet-with-pre-populated-footer
(Sorry for the delay. I missed your message as I was busy writing at the beginning of this month.) As you may know, I let different threads read different RGs of the same Parquet file. This does not mean the datasource is read multiple times. However, one rare but valid case is the self‑join, essentially joining a table with itself (e.g., A JOIN A) for filtering purposes. In this case, I still need to fully read table A once to build the hash table, and then read A again to probe the hash table. There are a couple of reasons why we did not cache A in memory: 0. Caching costs us memory consumption. 1. We do not have any query optimizer. 2. We expect there may be cases where repeated reading is necessary, so we kept the design general. This is more of a query processing & optimization discussion and slightly off‑topic for a reader of this issue. We could continue the discussion on Slack if that helps. |
|
pre-commit.ci autofix |
|
/ok to test 5600ae5 |
|
/ok to test 87e3a2e |
|
/merge |
Description
Contributes to #20311. Closes #18890
This PR enables cuDF parquet readers (chunked and non-chunked) to use pre-constructed
datasource(s) andFileMetaData(s). This allows them to save compute time for re-reading footers when possible. This is particularly useful for workflows where one may want to only read file footers first, making some decisions based on that, and finally read parquet files without re-reading the footers.Checklist
setup_page_indexin hybrid scan reader #20721 before this PR