Is your feature request related to a problem or challenge?
However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached:
> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.
Describe the solution you'd like
I would like the queries above to go faster by using the ParquetMetaData cache
Describe alternatives you've considered
I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974
So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata
Additional context
No response
Is your feature request related to a problem or challenge?
@nuno-faria implemented the core Parquet Metadata caching logic in the following PR:
However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached:
Describe the solution you'd like
I would like the queries above to go faster by using the ParquetMetaData cache
Describe alternatives you've considered
I think this is related to the fact that there is a separate path to retrieve statistics for
ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata
Additional context
No response