Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

To enable expression pushdown to file sources, we need to plumb expressions through the FileScanConfig layer. Currently, FileScanConfig only tracks column indices for projection, which limits us to simple and naive column selection.

This PR begins expression pushdown implementation by having FileScanConfig own a list of ProjectionExprs, instead of column indices. This allows file sources to eventually receive and optimize based on the actual expressions being projected.

Notes about this PR

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Oct 23, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch 3 times, most recently from 01f0446 to 248d1b9 Compare October 23, 2025 20:49
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch from 248d1b9 to 9e4344a Compare October 23, 2025 23:14
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks great @friendlymatthew ! Let's have one more reviewer go through it though.

/// methods to manipulate and analyze the projection as a whole.
#[derive(Debug, Clone)]
pub struct Projection {
pub struct ProjectionExprs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this name is better and it's not a breaking change since Projection was introduced after the last release. To make sure we get this through as fast as possible (in particular before it does become a breaking change) could you make this it's own PR?

Comment on lines 467 to 469
let projection = projection_indices.as_ref().map(|indices| {
ProjectionExprs::from_indices(indices, table_schema.table_schema())
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is because we are not changing FileScanConfigBuilder. I think it makes sense to change it at some point but it's not necessary yet and we can keep things as backwards compatible as possible for as long as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deprecate with_projection and add a new function with_projection_indices to help downstream users prepare for the change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. The main place where this API is used by actual users is in TableProvider::scan_with_args. The projection pushed down into TableProvider is a Vec<usize> so that's all you have. I'd argue that we should change that so that the full expression tree gets pushed down there as well. But for now we can either:

  1. Put a convenience method on FileScanConfigBuilder like you say.
  2. Make users call ProjectionExprs::from_indices(&projection, &schema) or something like that which seems like it isn't too bad and keeps us from adding a method we know we probably want to remove later.

Comment on lines 700 to 714
Some(proj) => proj.clone(),
Some(proj) => proj.ordered_column_indices(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I similarly think long term we want to get rid of projection_indices and some of these other helper functions (a lot of it drops away once we have the projection expressions because we can re-use the same machinery that ProjectionExec uses) but for now we do what we can to keep it backwards compatible and minimize churn.

}
}

pub fn from_indices(indices: &[usize], schema: &SchemaRef) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation, including handling of duplicates, ordering, etc.

.collect_vec()
}

/// Extract the ordered column indices for a column-only projection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more detailed documentation, e.g. what happens if the projection contains non-column expressions, what if the column expressions are nested within other expressions, etc.

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 27, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too -- thank you @friendlymatthew

Let's address @adriangb 's comments and get this merged in

/// Each expression in the projection can reference columns from both the file
/// schema and table partition columns. If `None`, all columns from the table
/// schema are projected.
pub projection: Option<ProjectionExprs>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a pub field, I think it qualifies as an API change so I marked the PR as such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'll rename the field to indicate it holds a list of projections

Comment on lines 467 to 469
let projection = projection_indices.as_ref().map(|indices| {
ProjectionExprs::from_indices(indices, table_schema.table_schema())
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deprecate with_projection and add a new function with_projection_indices to help downstream users prepare for the change?

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch from 207316d to f5091c3 Compare October 27, 2025 15:40
@github-actions github-actions bot added the catalog Related to the catalog crate label Oct 27, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch from eeb2a7f to 6d6776a Compare October 27, 2025 16:44
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 27, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch from e93cdb0 to 8eedb93 Compare October 27, 2025 17:02
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/pushdown-propogation branch from bc9e9f8 to d578348 Compare October 27, 2025 17:07
Comment on lines +332 to +342
/// # Deprecated
/// Use [`Self::with_projection_indices`] instead. This method will be removed in a future release.
#[deprecated(since = "51.0.0", note = "Use with_projection_indices instead")]
pub fn with_projection(self, indices: Option<Vec<usize>>) -> Self {
self.with_projection_indices(indices)
}

/// Set the columns on which to project the data using column indices.
///
/// Indexes that are higher than the number of columns of `file_schema` refer to `table_partition_cols`.
pub fn with_projection_indices(mut self, indices: Option<Vec<usize>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here to "reserve" the with_projection API to accept a &ProjectionExprs later on? Or could you elaborate on the long term vision behind the name change / deprecation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here to "reserve" the with_projection API to accept a &ProjectionExprs later on?

No, I'm planning to add with_projection_exprs. I just haven't wired it up yet-- I'm working through this incrementally to fix the test breakages first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay makes sense, followup it is then!

@adriangb adriangb added this pull request to the merge queue Oct 27, 2025
Merged via the queue into apache:main with commit 6eb8d45 Oct 27, 2025
33 checks passed
@adriangb adriangb deleted the friendlymatthew/pushdown-propogation branch October 27, 2025 18:48
@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

🚀

tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

- Related to apache#14993

## Rationale for this change

To enable expression pushdown to file sources, we need to plumb
expressions through the `FileScanConfig` layer. Currently,
`FileScanConfig` only tracks column indices for projection, which limits
us to simple and naive column selection.

This PR begins expression pushdown implementation by having
`FileScanConfig` own a list of `ProjectionExpr`s, instead of column
indices. This allows file sources to eventually receive and optimize
based on the actual expressions being projected.


## Notes about this PR
- The first commit is based off of
apache#18231
- To avoid a super large diff and a harder review, I've decided to break
(apache#14993) into 2 tasks:
- Have the `DataSource` (`FileScanConfig`) actually hold projection
expressions (this PR)
- Flow the projection expressions from `DataSourceExec` all the way to
the `FileSource`

---------

Co-authored-by: Adrian Garcia Badaracco <[email protected]>
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
## Which issue does this PR close?

- Related to apache#14993

## Rationale for this change

To enable expression pushdown to file sources, we need to plumb
expressions through the `FileScanConfig` layer. Currently,
`FileScanConfig` only tracks column indices for projection, which limits
us to simple and naive column selection.

This PR begins expression pushdown implementation by having
`FileScanConfig` own a list of `ProjectionExpr`s, instead of column
indices. This allows file sources to eventually receive and optimize
based on the actual expressions being projected.


## Notes about this PR
- The first commit is based off of
apache#18231
- To avoid a super large diff and a harder review, I've decided to break
(apache#14993) into 2 tasks:
- Have the `DataSource` (`FileScanConfig`) actually hold projection
expressions (this PR)
- Flow the projection expressions from `DataSourceExec` all the way to
the `FileSource`

---------

Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants