FilePartition and PartitionedFile for scanning flexibility #932

yjshen · 2021-08-23T14:10:56Z

Which issue does this PR close?

Closes #946 .

Rationale for this change

For potentially finer-grained readers that parallelize even one file reading or balancing workload between scanning threads even in case of great variance in input file sizes. As I quote @andygrove here:

One of the current issues IMO with DataFusion is that we use "file" as the default unit of partitioning. We would be able to scale better if we had finer-grained readers such as reading Parquet row groups instead. This way we can have multiple threads reading from the same file concurrently and avoid the need to repartition first to increase concurrency.

Refactoring Logic in ParquetExec and parquet datasource. It's strange to call ParquetExec:: try_from_path to get planning-related metadata.

What changes are included in this PR?

PartitionedFile -> Single file (for the moment) or part of a file (later, part of the row groups or rows), and we may even extend this to include partition value and partition schema to support partitioned tables:
/path/to/table/root/p_date=20210813/p_hour=1200/xxxxx.parquet
FilePartition -> The basic unit for parallel processing, each task is responsible for processing one FilePartition which is composed of several PartitionFiles.
Update ballista protocol as well as the serdes to use the new abstraction.
Telling apart the planning related code from ParquetExec

Are there any user-facing changes?

No.

yjshen · 2021-08-25T16:13:24Z

cc @houqp @alamb @andygrove for review

houqp

left couple minor comments, the rest looks good to me!

ballista/rust/core/src/serde/logical_plan/from_proto.rs

ballista/rust/core/src/serde/logical_plan/to_proto.rs

houqp · 2021-08-28T03:52:43Z

ballista/rust/scheduler/src/lib.rs

-                        .collect(),
+                    schema: Some(parquet_desc.schema().as_ref().into()),
+                    partitions: vec![FilePartitionMetadata {
+                        filename: vec![path],


I remember we discussed this in the original PR. After taking a second look at the code, I am still not fully following the change here. The old behavior has FilePartitionMetadata.filename set to a vector of file paths returned from a directory list, while the new behavior here has the filename always set to a vector of single entry with value set to the root path of the table.

Shouldn't we use parquet_desc.descriptor.descriptor to build the filename vector here instead?

I changed it to a vector of all the files.

However, after searching for a while in the project, I find this method may not be actually used, it's hard to understand this RPC's intention as well. Perhaps it's deprecated and we should remove it later?

rpc GetFileMetadata (GetFileMetadataParams) returns (GetFileMetadataResult) {}

I had the same question when I was going through the code base yesterday, I noticed it's only mentioned in ballista/docs/architecture.md. @andygrove do you know if this RPC method is still needed?

houqp

Great refactor @yjshen !

alamb

This is looking great @yjshen -- thank you for persevering. I think this PR looks great other than the addition of filter to the LogicalPlanBuilder::scan (see comments on that).

I didn't review the ballista changes, but I assume they are mostly mechanical

Again, thank you so much and sorry for the long review cycle

alamb · 2021-08-29T09:55:05Z

datafusion/src/datasource/mod.rs

+    pub file_path: String,
+    /// Statistics of the file
+    pub statistics: Statistics,
+    // Values of partition columns to be appended to each row


I think in order to take full advantage of partition values (which might span multiple columns, for example), more information about the partitioning scheme will be needed (e.g. what expression is used to generated partitioning values). Adding partitioning support to DataFusion's planning / execution is probably worth its own discussion

(that is to say I agree with postponing adding anything partition specific)

datafusion/src/datasource/mod.rs

alamb · 2021-08-29T10:05:49Z

datafusion/src/physical_plan/parquet.rs

    logical_plan::{Column, Expr},
    physical_optimizer::pruning::{PruningPredicate, PruningStatistics},
    physical_plan::{
-        common, DisplayFormatType, ExecutionPlan, Partitioning, SendableRecordBatchStream,


I really like how the statistics and schema related code has been moved out of physical_plan and into datasource

alamb · 2021-08-29T10:16:20Z

datafusion/src/logical_plan/builder.rs

        table_name: impl Into<String>,
        provider: Arc<dyn TableProvider>,
        projection: Option<Vec<usize>>,
+        filters: Option<Vec<Expr>>,


I think this argument is likely going to be confusing to users and it should be removed.

For example as a user of LogicalPlanBuilder I would probably assume that the following plan would return only rows where with a<5

// Build a plan that looks like it would filter out all rows with `a < 5` let plan = builder.scan("table", provider, None, vec![col("a").lt(lit(5)));

However, I am pretty sure it could (and often would) return rows with a >= 5). This is because filters added to a TableScan node are optional (in the sense that the provider might not filter rows that do not pass the predicate, but is not required to). Indeed, even for the parquet provider, the filters are only used for row group pruning which may or may not be able to filter rows.

I think we could solve this with:

Leave scan signature alone and rely on the predicate pushdown optimization to push filters appropriately down to the scan (my preference as it is simpler for the users)

Rename this argument to something like 'optional_filters_for_performance' and document what it does more carefully. I think it would be challenging to explain as it might/might not do anything depending on how the data was laid out.

Removed, and keep the filters not deserialized for ballista as before.

yjshen · 2021-08-29T14:29:40Z

@houqp @alamb I've resolved the comments, PTAL, thanks.

houqp

LGTM!

alamb

Thanks @yjshen !

yjshen · 2021-08-30T13:30:31Z

Thanks @houqp @alamb for your great help!

houqp · 2021-08-30T15:00:39Z

Thank you @yjshen for being patient and driving through this big change step by step :)

* feat: implement scripts for binary release build * Install to temp local maven repo and updates for MacOS * newline * Use independent docker images for different architectures instead of a multi-arch image * update docs and cleanup * remove unused code * fail build script on error * Build all profiles * remove duplicate target from makefile --------- Co-authored-by: Andy Grove <[email protected]>

FilePartition and partitionedFile for scanning flexibility

5cb4d63

github-actions bot added ballista labels Aug 23, 2021

yjshen changed the title ~~FilePartition and partitionedFile for scanning flexibility~~ FilePartition and PartitionedFile for scanning flexibility Aug 23, 2021

yjshen added 3 commits August 23, 2021 22:26

clippy

794a28d

remove schema from partitioned file

f50b1a3

ballista logical parquet table

ab71fa6

github-actions bot added the sql SQL Planner label Aug 25, 2021

yjshen added 2 commits August 25, 2021 20:34

ballista physical parquet exec

fd2a0b0

Merge remote-tracking branch 'apache/master' into pf_only

ca68d6e

yjshen marked this pull request as ready for review August 25, 2021 14:47

Merge remote-tracking branch 'apache/master' into pf_only

efc911c

This was referenced Aug 26, 2021

Add support for reading remote storage systems #811

Closed

ObjectStore API to read from remote storage systems #950

Merged

houqp reviewed Aug 28, 2021

View reviewed changes

houqp requested review from Dandandan, alamb, andygrove and jorgecarleitao August 28, 2021 04:20

resolve comments

030fb55

houqp approved these changes Aug 29, 2021

View reviewed changes

houqp added api change Changes the API exposed to users of the crate enhancement New feature or request labels Aug 29, 2021

houqp mentioned this pull request Aug 29, 2021

Lazy load parquet roapi/roapi#63

Merged

alamb approved these changes Aug 29, 2021

View reviewed changes

resolve comments

288db83

houqp approved these changes Aug 29, 2021

View reviewed changes

alamb approved these changes Aug 30, 2021

View reviewed changes

alamb merged commit 8a085fc into apache:master Aug 30, 2021

yjshen deleted the pf_only branch August 30, 2021 13:27

yjshen mentioned this pull request Sep 1, 2021

Is RPC GetFileMetadata in Ballista still needed? #963

Closed

rdettai mentioned this pull request Sep 16, 2021

Reorganize table providers by table format #1009

Closed

yjshen mentioned this pull request Oct 12, 2021

Replace file format providers rdettai/arrow-datafusion#2

Merged

FilePartition and PartitionedFile for scanning flexibility #932

FilePartition and PartitionedFile for scanning flexibility #932

Uh oh!

Conversation

yjshen commented Aug 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

yjshen commented Aug 25, 2021

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

houqp Aug 28, 2021

Choose a reason for hiding this comment

Uh oh!

yjshen Aug 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

yjshen Aug 29, 2021

Choose a reason for hiding this comment

Uh oh!

yjshen commented Aug 29, 2021

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

yjshen commented Aug 30, 2021

Uh oh!

houqp commented Aug 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yjshen commented Aug 23, 2021 •

edited

Loading

yjshen Aug 29, 2021 •

edited

Loading