[SPARK-33673][SQL] Avoid push down partition filters to ParquetScan for DataSourceV2 #30652

LuciferYang · 2020-12-07T16:00:54Z

What changes were proposed in this pull request?

As described in SPARK-33673, some test suites in ParquetV2SchemaPruningSuite will failed when set parquet.version to 1.11.1 because Parquet will return empty results for non-existent column since PARQUET-1765.

This pr change to use readDataSchema() instead of schema to build pushedParquetFilters in ParquetScanBuilder to avoid push down partition filters to ParquetScan for DataSourceV2

Why are the changes needed?

Prepare for upgrade using Parquet 1.11.1.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the Jenkins or GitHub Action
Manual test as follows:

mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite -Dparquet.version=1.11.1 test -pl sql/core -am

Before

Run completed in 3 minutes, 13 seconds.
Total number of tests run: 134
Suites: completed 2, aborted 0
Tests: succeeded 120, failed 14, canceled 0, ignored 0, pending 0
*** 14 TESTS FAILED ***

After

Run completed in 3 minutes, 46 seconds.
Total number of tests run: 134
Suites: completed 2, aborted 0
Tests: succeeded 134, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

LuciferYang · 2020-12-07T16:05:30Z

cc @wangyum

SparkQA · 2020-12-07T17:29:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36981/

SparkQA · 2020-12-07T18:04:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36981/

SparkQA · 2020-12-07T19:16:18Z

Test build #132381 has finished for PR 30652 at commit be6cfc8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-12-08T00:31:29Z

Thanks for fixing this so quickly.

viirya · 2020-12-08T01:14:30Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

    val pushDownInFilterThreshold = sqlConf.parquetFilterPushDownInFilterThreshold
    val isCaseSensitive = sqlConf.caseSensitiveAnalysis
    val parquetSchema =
-      new SparkToParquetSchemaConverter(sparkSession.sessionState.conf).convert(schema)


What if the partition column is in file? I guess this might be why we push down partition filter.

This is a good question, but it seems that the filter pushed down in DataSource V1 does not contain the filter related to partition columns too.

The dataFilters use to construct FileSourceScanExec and pass to ParquetFileFormat.buildReaderWithPartitionValues to build pushed filters also filtered out partition filters, am I right?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 184 to 193 in e4d1c10

// Partition keys are not available in the statistics of the files.

// `dataColumns` might have partition columns, we need to filter them out.

val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains)

val dataFilters = normalizedFiltersWithoutSubqueries.flatMap { f =>

if (f.references.intersect(partitionSet).nonEmpty) {

extractPredicatesWithinOutputSet(f, AttributeSet(dataColumnsWithoutPartitionCols))

} else {

Some(f)

}

}

Maybe another way is refer to parquet FileMetaData#getSchema to determine which filter should be pushed down

Maybe this is a correct fix. I have a pr to fix partition column is in file.
#30670

Maybe we should use readDataSchema() instead of dataSchema

Address b9f8eb2 use readDataSchema() instead of dataSchema

SparkQA · 2020-12-08T04:18:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37002/

SparkQA · 2020-12-08T04:52:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37002/

SparkQA · 2020-12-08T05:52:59Z

Test build #132402 has finished for PR 30652 at commit fb01c3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-12-08T09:05:44Z

retest this please.

SparkQA · 2020-12-08T10:20:02Z

Test build #132424 has finished for PR 30652 at commit fb01c3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-08T10:27:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37024/

SparkQA · 2020-12-08T10:54:48Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37024/

LuciferYang · 2020-12-08T12:33:36Z

Local test org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite all passed

Discovery starting.
Discovery completed in 906 milliseconds.
Run starting. Expected test count is: 3
HiveThriftHttpServerSuite:
20:03:14.455 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:14.455 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:14.455 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:14.456 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:14.456 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:14.460 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
20:03:38.415 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
- JDBC query execution
20:03:38.417 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:38.417 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:38.417 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.417 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:38.417 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
- Checks Hive version
20:03:38.901 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.901 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:38.901 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:38.901 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.901 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:38.901 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
- SPARK-24829 Checks cast as float
Run completed in 52 seconds, 636 milliseconds.
Total number of tests run: 3
Suites: completed 2, aborted 0
Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

SparkQA · 2020-12-09T07:07:03Z

Test build #132456 has finished for PR 30652 at commit b9f8eb2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2020-12-09T07:11:26Z

Waiting for SPARK-33705 to fix 3 failed UTs in thriftsever module

gengliangwang

LGTM

wangyum · 2020-12-14T09:53:24Z

Thanks all. Merged to master.

LuciferYang · 2020-12-14T09:54:02Z

Thanks for your review @wangyum @viirya @cloud-fan @gengliangwang ~

SparkQA · 2020-12-14T10:21:20Z

Test build #132749 has finished for PR 30652 at commit 13bfda3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait CheckAnalysis extends PredicateHelper with LookupCatalog
case class UnresolvedView(
case class Decode(params: Seq[Expression], child: Expression) extends RuntimeReplaceable
case class StringDecode(bin: Expression, charset: Expression)
case class NoopCommand(
case class ShowTableExtended(
case class DropView(
case class RepairTable(child: LogicalPlan) extends Command
case class AlterViewSetProperties(
case class AlterViewUnsetProperties(
case class CacheTable(
case class CacheTableAsSelect(
trait BaseCacheTableExec extends V2CommandExec
case class CacheTableExec(
case class CacheTableAsSelectExec(

### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == *(1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- *(1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. #30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == *(1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- *(1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. #30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b04330c) Signed-off-by: Dongjoon Hyun <[email protected]>

use dataSchema instead of schema

be6cfc8

github-actions bot added the SQL label Dec 7, 2020

viirya reviewed Dec 8, 2020

View reviewed changes

fix ExplainSuite

fb01c3f

wangyum requested review from cloud-fan, dilipbiswal and gengliangwang December 9, 2020 02:05

use readDataSchema instead of dataSchema

b9f8eb2

Merge branch 'upmaster' into SPARK-33673

13bfda3

cloud-fan approved these changes Dec 14, 2020

View reviewed changes

gengliangwang approved these changes Dec 14, 2020

View reviewed changes

wangyum closed this in cd0356d Dec 14, 2020

huaxingao mentioned this pull request Aug 8, 2021

[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 #33680

Closed

LuciferYang deleted the SPARK-33673 branch June 6, 2022 03:45

	// Partition keys are not available in the statistics of the files.
	// `dataColumns` might have partition columns, we need to filter them out.
	val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains)
	val dataFilters = normalizedFiltersWithoutSubqueries.flatMap { f =>
	if (f.references.intersect(partitionSet).nonEmpty) {
	extractPredicatesWithinOutputSet(f, AttributeSet(dataColumnsWithoutPartitionCols))
	} else {
	Some(f)
	}
	}

[SPARK-33673][SQL] Avoid push down partition filters to ParquetScan for DataSourceV2 #30652

[SPARK-33673][SQL] Avoid push down partition filters to ParquetScan for DataSourceV2 #30652

Uh oh!

Conversation

LuciferYang commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang commented Dec 7, 2020

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

wangyum commented Dec 8, 2020

Uh oh!

viirya Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

wangyum commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 8, 2020

Uh oh!

LuciferYang commented Dec 8, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

LuciferYang commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

wangyum commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Dec 14, 2020

Uh oh!

SparkQA commented Dec 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

LuciferYang commented Dec 7, 2020 •

edited

Loading

LuciferYang Dec 8, 2020 •

edited

Loading

LuciferYang Dec 8, 2020 •

edited

Loading

wangyum Dec 8, 2020 •

edited

Loading

LuciferYang commented Dec 9, 2020 •

edited

Loading

wangyum commented Dec 14, 2020 •

edited

Loading