Skip to content

feat: Make parquet FileMetadata prunable for IR-plan dispatch#27535

Open
azimafroozeh wants to merge 2 commits intopola-rs:mainfrom
azimafroozeh:feat/prunable_metadata
Open

feat: Make parquet FileMetadata prunable for IR-plan dispatch#27535
azimafroozeh wants to merge 2 commits intopola-rs:mainfrom
azimafroozeh:feat/prunable_metadata

Conversation

@azimafroozeh
Copy link
Copy Markdown
Collaborator

Make parquet FileMetadata prunable to ship in IR plans, so distributed workers can avoid re-fetching footers from cloud storage and re-decoding thrift per query. Pruning drops non-projected columns, stats from non-predicate columns, and row-group fields with no read-path consumers; an env-gated optimizer pass (POLARS_PRUNE_PARQUET_METADATA=1, default off) calls pruned() at the end of optimize(). Single-node users see no behavior change; the table below shows what would ship per query when the gate is enabled.

TPCH SF=10 wire bytes per query

table Q1 Q3 Q5 Q6 Q19
lineitem 80.1 KB (11.8×) 52.3 KB (18.0×) 44.2 KB (21.4×) 78.1 KB (12.1×) 104.5 KB (9.0×)
orders 12.5 KB (12.2×) 10.6 KB (14.4×)
customer 1.0 KB (16.5×) 664 B (25.8×)
part 2.7 KB (7.9×)
supplier 90 B (13.0×)
nation 93 B (7.8×)
region 86 B (7.2×)
TOTAL ship 80.1 KB (11.8×) 65.9 KB (16.9×) 55.7 KB (20.0×) 78.1 KB (12.1×) 107.2 KB (9.0×)
vs re-fetch 943.7 KB 1113.7 KB 1116.2 KB 943.7 KB 964.9 KB

@github-actions github-actions Bot added A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars title needs formatting labels May 7, 2026
@azimafroozeh azimafroozeh changed the title feat: make parquet FileMetadata prunable for IR-plan dispatch feat: Make parquet FileMetadata prunable for IR-plan dispatch May 7, 2026
@azimafroozeh azimafroozeh changed the title feat: Make parquet FileMetadata prunable for IR-plan dispatch feat: Make parquet FileMetadata prunable for dispatch May 7, 2026
@azimafroozeh azimafroozeh changed the title feat: Make parquet FileMetadata prunable for dispatch feat: Make parquet metadata prunable for dispatch May 7, 2026
@azimafroozeh azimafroozeh changed the title feat: Make parquet metadata prunable for dispatch feat: Make parquet FileMetadata prunable for IR-plan dispatch May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 8.24742% with 267 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.30%. Comparing base (4808230) to head (d74ca87).

Files with missing lines Patch % Lines
...arquet/src/parquet/metadata/file_metadata_serde.rs 0.00% 151 Missing ⚠️
...lars-parquet/src/parquet/metadata/file_metadata.rs 0.00% 55 Missing ⚠️
...plan/src/plans/optimizer/parquet_metadata_prune.rs 22.22% 28 Missing ⚠️
crates/polars-python/src/functions/io.rs 0.00% 18 Missing ⚠️
...-parquet/src/parquet/metadata/schema_descriptor.rs 0.00% 11 Missing ⚠️
...quet/src/parquet/metadata/column_chunk_metadata.rs 25.00% 3 Missing ⚠️
crates/polars-parquet/src/arrow/read/statistics.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27535      +/-   ##
==========================================
- Coverage   81.42%   81.30%   -0.12%     
==========================================
  Files        1837     1839       +2     
  Lines      255165   255449     +284     
  Branches     3179     3179              
==========================================
- Hits       207759   207685      -74     
- Misses      46582    46940     +358     
  Partials      824      824              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pub fn pruned(
&self,
keep_top_level_names: &[polars_utils::pl_str::PlSmallStr],
predicate_top_level_names: &[polars_utils::pl_str::PlSmallStr],
Copy link
Copy Markdown
Collaborator

@nameexhaustion nameexhaustion May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking:

Could we make a lookup for 1 side (either the kept names or the parquet schema names)? We've had cases in the past where users had >=10,000 columns and saw quadratic slowdowns on lookups.
If we do kept names, I'm thinking maybe HashMap<PlSmallStr, bool>, where the bool indicates true if we want to keep statistics and false otherwise.

/// Some(cols)` ⇒ apply `pruned(cols, predicate)`. Local files only.
#[cfg(all(feature = "parquet", feature = "json"))]
#[pyfunction]
pub fn _bench_parquet_metadata_bincode_size(
Copy link
Copy Markdown
Collaborator

@nameexhaustion nameexhaustion May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Could we (if it's feasible) serialize to / return a JSON string, then add a test in Python that inspects e.g. unused columns are pruned, and statistics are only kept for predicate cols?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants