Skip to content

feat(python): Add extra_columns parameter to scan_parquet#22699

Merged
ritchie46 merged 2 commits intopola-rs:mainfrom
nameexhaustion:parquet-extra-columns
May 26, 2025
Merged

feat(python): Add extra_columns parameter to scan_parquet#22699
ritchie46 merged 2 commits intopola-rs:mainfrom
nameexhaustion:parquet-extra-columns

Conversation

@nameexhaustion
Copy link
Copy Markdown
Collaborator

@nameexhaustion nameexhaustion commented May 9, 2025

Supercedes #22695

Changes:

  • Adds an extra_columns to scan_parquet
  • (Rust) extra_columns_policy is now added under UnifiedScanArgs
  • Introduces an internal ScanOptions Python class to consolidate input parsing of shared scan options (i.e. those in UnifiedScanArgs).

Fixes:

@github-actions github-actions Bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels May 9, 2025
@nameexhaustion nameexhaustion changed the title feat(python): Add extra_columns parameter to scan_parquet feat(python, rust!): Add extra_columns parameter to scan_parquet May 9, 2025
@nameexhaustion nameexhaustion changed the title feat(python, rust!): Add extra_columns parameter to scan_parquet feat(python,rust!): Add extra_columns parameter to scan_parquet May 9, 2025
@github-actions github-actions Bot added breaking rust Change that breaks backwards compatibility for the Rust crate rust Related to Rust Polars and removed title needs formatting labels May 9, 2025
@nameexhaustion nameexhaustion changed the title feat(python,rust!): Add extra_columns parameter to scan_parquet feat(python): Add extra_columns parameter to scan_parquet May 9, 2025
@nameexhaustion nameexhaustion force-pushed the parquet-extra-columns branch from 8ef8919 to 6d2ff40 Compare May 12, 2025 13:48
@braaannigan
Copy link
Copy Markdown
Contributor

@nameexhaustion @ion-elgreco Should this extra_columns parameter for scan_parquet be surfaced in scan_delta? I get exceptions at the moment where I try to load a multifile delta table where columns have been added over time. I'd like to be able to enable the extra_columns=True argument in scan_delta.

@ion-elgreco
Copy link
Copy Markdown
Contributor

@nameexhaustion @ion-elgreco Should this extra_columns parameter for scan_parquet be surfaced in scan_delta? I get exceptions at the moment where I try to load a multifile delta table where columns have been added over time. I'd like to be able to enable the extra_columns=True argument in scan_delta.

Looks like a bug to me, providing the schema before-hand should be enough information for the reader to figure out what to do with missing columns or extra columns.

@nameexhaustion nameexhaustion force-pushed the parquet-extra-columns branch 2 times, most recently from 15f4fea to ceccfa5 Compare May 15, 2025 13:46
@nameexhaustion
Copy link
Copy Markdown
Collaborator Author

I get exceptions at the moment where I try to load a multifile delta table where columns have been added over time.

This PR addresses the case where columns are removed rather than added. The case when columns are added are already handled by allow_missing_columns.

I'd like to be able to enable the extra_columns=True argument in scan_delta.

This will be internally enabled by default in the future for scan_delta(), the user will not need to configure these manually.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2025

Codecov Report

Attention: Patch coverage is 80.11050% with 36 lines in your changes missing coverage. Please review.

Project coverage is 80.97%. Comparing base (b9dd8cd) to head (aea4dfc).
Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/io/mod.rs 73.43% 34 Missing ⚠️
...polars-plan/src/plans/optimizer/expand_datasets.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22699      +/-   ##
==========================================
- Coverage   81.01%   80.97%   -0.05%     
==========================================
  Files        1671     1675       +4     
  Lines      236925   237101     +176     
  Branches     2792     2792              
==========================================
+ Hits       191956   191998      +42     
- Misses      44299    44433     +134     
  Partials      670      670              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nameexhaustion nameexhaustion force-pushed the parquet-extra-columns branch from 2c77789 to 37766b2 Compare May 22, 2025 10:42
@nameexhaustion nameexhaustion marked this pull request as ready for review May 22, 2025 11:09
@nameexhaustion nameexhaustion requested a review from orlp as a code owner May 22, 2025 11:09
@nameexhaustion nameexhaustion marked this pull request as draft May 26, 2025 07:49
@nameexhaustion nameexhaustion marked this pull request as ready for review May 26, 2025 07:49
@ritchie46 ritchie46 merged commit 15524f4 into pola-rs:main May 26, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error not raised for extra columns outside schema with combination of select(all()) and allow_missing_columns=True

4 participants