-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Let FileScanConfig own a list of ProjectionExprs
#18253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let FileScanConfig own a list of ProjectionExprs
#18253
Conversation
01f0446 to
248d1b9
Compare
248d1b9 to
9e4344a
Compare
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks great @friendlymatthew ! Let's have one more reviewer go through it though.
| /// methods to manipulate and analyze the projection as a whole. | ||
| #[derive(Debug, Clone)] | ||
| pub struct Projection { | ||
| pub struct ProjectionExprs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this name is better and it's not a breaking change since Projection was introduced after the last release. To make sure we get this through as fast as possible (in particular before it does become a breaking change) could you make this it's own PR?
| let projection = projection_indices.as_ref().map(|indices| { | ||
| ProjectionExprs::from_indices(indices, table_schema.table_schema()) | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is because we are not changing FileScanConfigBuilder. I think it makes sense to change it at some point but it's not necessary yet and we can keep things as backwards compatible as possible for as long as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we deprecate with_projection and add a new function with_projection_indices to help downstream users prepare for the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. The main place where this API is used by actual users is in TableProvider::scan_with_args. The projection pushed down into TableProvider is a Vec<usize> so that's all you have. I'd argue that we should change that so that the full expression tree gets pushed down there as well. But for now we can either:
- Put a convenience method on
FileScanConfigBuilderlike you say. - Make users call
ProjectionExprs::from_indices(&projection, &schema)or something like that which seems like it isn't too bad and keeps us from adding a method we know we probably want to remove later.
| Some(proj) => proj.clone(), | ||
| Some(proj) => proj.ordered_column_indices(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I similarly think long term we want to get rid of projection_indices and some of these other helper functions (a lot of it drops away once we have the projection expressions because we can re-use the same machinery that ProjectionExec uses) but for now we do what we can to keep it backwards compatible and minimize churn.
| } | ||
| } | ||
|
|
||
| pub fn from_indices(indices: &[usize], schema: &SchemaRef) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation, including handling of duplicates, ordering, etc.
| .collect_vec() | ||
| } | ||
|
|
||
| /// Extract the ordered column indices for a column-only projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more detailed documentation, e.g. what happens if the projection contains non-column expressions, what if the column expressions are nested within other expressions, etc.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too -- thank you @friendlymatthew
Let's address @adriangb 's comments and get this merged in
| /// Each expression in the projection can reference columns from both the file | ||
| /// schema and table partition columns. If `None`, all columns from the table | ||
| /// schema are projected. | ||
| pub projection: Option<ProjectionExprs>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a pub field, I think it qualifies as an API change so I marked the PR as such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'll rename the field to indicate it holds a list of projections
| let projection = projection_indices.as_ref().map(|indices| { | ||
| ProjectionExprs::from_indices(indices, table_schema.table_schema()) | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we deprecate with_projection and add a new function with_projection_indices to help downstream users prepare for the change?
207316d to
f5091c3
Compare
eeb2a7f to
6d6776a
Compare
e93cdb0 to
8eedb93
Compare
bc9e9f8 to
d578348
Compare
| /// # Deprecated | ||
| /// Use [`Self::with_projection_indices`] instead. This method will be removed in a future release. | ||
| #[deprecated(since = "51.0.0", note = "Use with_projection_indices instead")] | ||
| pub fn with_projection(self, indices: Option<Vec<usize>>) -> Self { | ||
| self.with_projection_indices(indices) | ||
| } | ||
|
|
||
| /// Set the columns on which to project the data using column indices. | ||
| /// | ||
| /// Indexes that are higher than the number of columns of `file_schema` refer to `table_partition_cols`. | ||
| pub fn with_projection_indices(mut self, indices: Option<Vec<usize>>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here to "reserve" the with_projection API to accept a &ProjectionExprs later on? Or could you elaborate on the long term vision behind the name change / deprecation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here to "reserve" the with_projection API to accept a &ProjectionExprs later on?
No, I'm planning to add with_projection_exprs. I just haven't wired it up yet-- I'm working through this incrementally to fix the test breakages first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay makes sense, followup it is then!
|
🚀 |
## Which issue does this PR close? - Related to apache#14993 ## Rationale for this change To enable expression pushdown to file sources, we need to plumb expressions through the `FileScanConfig` layer. Currently, `FileScanConfig` only tracks column indices for projection, which limits us to simple and naive column selection. This PR begins expression pushdown implementation by having `FileScanConfig` own a list of `ProjectionExpr`s, instead of column indices. This allows file sources to eventually receive and optimize based on the actual expressions being projected. ## Notes about this PR - The first commit is based off of apache#18231 - To avoid a super large diff and a harder review, I've decided to break (apache#14993) into 2 tasks: - Have the `DataSource` (`FileScanConfig`) actually hold projection expressions (this PR) - Flow the projection expressions from `DataSourceExec` all the way to the `FileSource` --------- Co-authored-by: Adrian Garcia Badaracco <[email protected]>
## Which issue does this PR close? - Related to apache#14993 ## Rationale for this change To enable expression pushdown to file sources, we need to plumb expressions through the `FileScanConfig` layer. Currently, `FileScanConfig` only tracks column indices for projection, which limits us to simple and naive column selection. This PR begins expression pushdown implementation by having `FileScanConfig` own a list of `ProjectionExpr`s, instead of column indices. This allows file sources to eventually receive and optimize based on the actual expressions being projected. ## Notes about this PR - The first commit is based off of apache#18231 - To avoid a super large diff and a harder review, I've decided to break (apache#14993) into 2 tasks: - Have the `DataSource` (`FileScanConfig`) actually hold projection expressions (this PR) - Flow the projection expressions from `DataSourceExec` all the way to the `FileSource` --------- Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Which issue does this PR close?
TableProviders#14993Rationale for this change
To enable expression pushdown to file sources, we need to plumb expressions through the
FileScanConfiglayer. Currently,FileScanConfigonly tracks column indices for projection, which limits us to simple and naive column selection.This PR begins expression pushdown implementation by having
FileScanConfigown a list ofProjectionExprs, instead of column indices. This allows file sources to eventually receive and optimize based on the actual expressions being projected.Notes about this PR
TableProviders#14993) into 2 tasks:DataSource(FileScanConfig) actually hold projection expressions (this PR)DataSourceExecall the way to theFileSource