feat: Make parquet FileMetadata prunable for IR-plan dispatch#27535
feat: Make parquet FileMetadata prunable for IR-plan dispatch#27535azimafroozeh wants to merge 2 commits intopola-rs:mainfrom
FileMetadata prunable for IR-plan dispatch#27535Conversation
FileMetadata prunable for IR-plan dispatch FileMetadata prunable for IR-plan dispatch
FileMetadata prunable for IR-plan dispatch FileMetadata prunable for dispatch
FileMetadata prunable for dispatch FileMetadata prunable for IR-plan dispatch
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #27535 +/- ##
==========================================
- Coverage 81.42% 81.30% -0.12%
==========================================
Files 1837 1839 +2
Lines 255165 255449 +284
Branches 3179 3179
==========================================
- Hits 207759 207685 -74
- Misses 46582 46940 +358
Partials 824 824 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| pub fn pruned( | ||
| &self, | ||
| keep_top_level_names: &[polars_utils::pl_str::PlSmallStr], | ||
| predicate_top_level_names: &[polars_utils::pl_str::PlSmallStr], |
There was a problem hiding this comment.
blocking:
Could we make a lookup for 1 side (either the kept names or the parquet schema names)? We've had cases in the past where users had >=10,000 columns and saw quadratic slowdowns on lookups.
If we do kept names, I'm thinking maybe HashMap<PlSmallStr, bool>, where the bool indicates true if we want to keep statistics and false otherwise.
| /// Some(cols)` ⇒ apply `pruned(cols, predicate)`. Local files only. | ||
| #[cfg(all(feature = "parquet", feature = "json"))] | ||
| #[pyfunction] | ||
| pub fn _bench_parquet_metadata_bincode_size( |
There was a problem hiding this comment.
suggestion:
Could we (if it's feasible) serialize to / return a JSON string, then add a test in Python that inspects e.g. unused columns are pruned, and statistics are only kept for predicate cols?
Make parquet
FileMetadataprunable to ship in IR plans, so distributed workers can avoid re-fetching footers from cloud storage and re-decoding thrift per query. Pruning drops non-projected columns, stats from non-predicate columns, and row-group fields with no read-path consumers; an env-gated optimizer pass (POLARS_PRUNE_PARQUET_METADATA=1, default off) callspruned()at the end ofoptimize(). Single-node users see no behavior change; the table below shows what would ship per query when the gate is enabled.TPCH SF=10 wire bytes per query