-
Notifications
You must be signed in to change notification settings - Fork 1.9k
FilePartition and PartitionedFile for scanning flexibility #932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @houqp @alamb @andygrove for review |
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left couple minor comments, the rest looks good to me!
ballista/rust/scheduler/src/lib.rs
Outdated
| .collect(), | ||
| schema: Some(parquet_desc.schema().as_ref().into()), | ||
| partitions: vec![FilePartitionMetadata { | ||
| filename: vec![path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we discussed this in the original PR. After taking a second look at the code, I am still not fully following the change here. The old behavior has FilePartitionMetadata.filename set to a vector of file paths returned from a directory list, while the new behavior here has the filename always set to a vector of single entry with value set to the root path of the table.
Shouldn't we use parquet_desc.descriptor.descriptor to build the filename vector here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a vector of all the files.
However, after searching for a while in the project, I find this method may not be actually used, it's hard to understand this RPC's intention as well. Perhaps it's deprecated and we should remove it later?
rpc GetFileMetadata (GetFileMetadataParams) returns (GetFileMetadataResult) {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question when I was going through the code base yesterday, I noticed it's only mentioned in ballista/docs/architecture.md. @andygrove do you know if this RPC method is still needed?
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great refactor @yjshen !
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @yjshen -- thank you for persevering. I think this PR looks great other than the addition of filter to the LogicalPlanBuilder::scan (see comments on that).
I didn't review the ballista changes, but I assume they are mostly mechanical
Again, thank you so much and sorry for the long review cycle
| pub file_path: String, | ||
| /// Statistics of the file | ||
| pub statistics: Statistics, | ||
| // Values of partition columns to be appended to each row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in order to take full advantage of partition values (which might span multiple columns, for example), more information about the partitioning scheme will be needed (e.g. what expression is used to generated partitioning values). Adding partitioning support to DataFusion's planning / execution is probably worth its own discussion
(that is to say I agree with postponing adding anything partition specific)
| logical_plan::{Column, Expr}, | ||
| physical_optimizer::pruning::{PruningPredicate, PruningStatistics}, | ||
| physical_plan::{ | ||
| common, DisplayFormatType, ExecutionPlan, Partitioning, SendableRecordBatchStream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how the statistics and schema related code has been moved out of physical_plan and into datasource
| table_name: impl Into<String>, | ||
| provider: Arc<dyn TableProvider>, | ||
| projection: Option<Vec<usize>>, | ||
| filters: Option<Vec<Expr>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this argument is likely going to be confusing to users and it should be removed.
For example as a user of LogicalPlanBuilder I would probably assume that the following plan would return only rows where with a<5
// Build a plan that looks like it would filter out all rows with `a < 5`
let plan = builder.scan("table", provider, None, vec![col("a").lt(lit(5)));However, I am pretty sure it could (and often would) return rows with a >= 5). This is because filters added to a TableScan node are optional (in the sense that the provider might not filter rows that do not pass the predicate, but is not required to). Indeed, even for the parquet provider, the filters are only used for row group pruning which may or may not be able to filter rows.
I think we could solve this with:
- Leave
scansignature alone and rely on the predicate pushdown optimization to push filters appropriately down to the scan (my preference as it is simpler for the users) - Rename this argument to something like 'optional_filters_for_performance' and document what it does more carefully. I think it would be challenging to explain as it might/might not do anything depending on how the data was laid out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, and keep the filters not deserialized for ballista as before.
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yjshen !
|
Thank you @yjshen for being patient and driving through this big change step by step :) |
* feat: implement scripts for binary release build * Install to temp local maven repo and updates for MacOS * newline * Use independent docker images for different architectures instead of a multi-arch image * update docs and cleanup * remove unused code * fail build script on error * Build all profiles * remove duplicate target from makefile --------- Co-authored-by: Andy Grove <[email protected]>
Which issue does this PR close?
Closes #946 .
Rationale for this change
ParquetExec:: try_from_pathto get planning-related metadata.What changes are included in this PR?
PartitionedFile -> Single file (for the moment) or part of a file (later, part of the row groups or rows), and we may even extend this to include partition value and partition schema to support partitioned tables:
/path/to/table/root/p_date=20210813/p_hour=1200/xxxxx.parquet
FilePartition -> The basic unit for parallel processing, each task is responsible for processing one FilePartition which is composed of several PartitionFiles.
Update ballista protocol as well as the serdes to use the new abstraction.
Telling apart the planning related code from
ParquetExecAre there any user-facing changes?
No.