Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Dec 7, 2020

What changes were proposed in this pull request?

As described in SPARK-33673, some test suites in ParquetV2SchemaPruningSuite will failed when set parquet.version to 1.11.1 because Parquet will return empty results for non-existent column since PARQUET-1765.

This pr change to use readDataSchema() instead of schema to build pushedParquetFilters in ParquetScanBuilder to avoid push down partition filters to ParquetScan for DataSourceV2

Why are the changes needed?

Prepare for upgrade using Parquet 1.11.1.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Pass the Jenkins or GitHub Action

  • Manual test as follows:

mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite -Dparquet.version=1.11.1 test -pl sql/core -am

Before

Run completed in 3 minutes, 13 seconds.
Total number of tests run: 134
Suites: completed 2, aborted 0
Tests: succeeded 120, failed 14, canceled 0, ignored 0, pending 0
*** 14 TESTS FAILED ***

After

Run completed in 3 minutes, 46 seconds.
Total number of tests run: 134
Suites: completed 2, aborted 0
Tests: succeeded 134, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@LuciferYang
Copy link
Contributor Author

cc @wangyum

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36981/

@github-actions github-actions bot added the SQL label Dec 7, 2020
@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36981/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132381 has finished for PR 30652 at commit be6cfc8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Dec 8, 2020

Thanks for fixing this so quickly.

val pushDownInFilterThreshold = sqlConf.parquetFilterPushDownInFilterThreshold
val isCaseSensitive = sqlConf.caseSensitiveAnalysis
val parquetSchema =
new SparkToParquetSchemaConverter(sparkSession.sessionState.conf).convert(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the partition column is in file? I guess this might be why we push down partition filter.

Copy link
Contributor Author

@LuciferYang LuciferYang Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question, but it seems that the filter pushed down in DataSource V1 does not contain the filter related to partition columns too.

The dataFilters use to construct FileSourceScanExec and pass to ParquetFileFormat.buildReaderWithPartitionValues to build pushed filters also filtered out partition filters, am I right?

// Partition keys are not available in the statistics of the files.
// `dataColumns` might have partition columns, we need to filter them out.
val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains)
val dataFilters = normalizedFiltersWithoutSubqueries.flatMap { f =>
if (f.references.intersect(partitionSet).nonEmpty) {
extractPredicatesWithinOutputSet(f, AttributeSet(dataColumnsWithoutPartitionCols))
} else {
Some(f)
}
}

Copy link
Contributor Author

@LuciferYang LuciferYang Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another way is refer to parquet FileMetaData#getSchema to determine which filter should be pushed down

Copy link
Member

@wangyum wangyum Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a correct fix. I have a pr to fix partition column is in file.
#30670

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use readDataSchema() instead of dataSchema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address b9f8eb2 use readDataSchema() instead of dataSchema

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37002/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37002/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Test build #132402 has finished for PR 30652 at commit fb01c3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Dec 8, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Test build #132424 has finished for PR 30652 at commit fb01c3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37024/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37024/

@LuciferYang
Copy link
Contributor Author

Local test org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite all passed

Discovery starting.
Discovery completed in 906 milliseconds.
Run starting. Expected test count is: 3
HiveThriftHttpServerSuite:
20:03:14.455 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:14.455 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:14.455 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:14.456 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:14.456 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:14.460 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
20:03:38.415 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
- JDBC query execution
20:03:38.417 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:38.417 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:38.417 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.417 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:38.417 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
- Checks Hive version
20:03:38.901 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.901 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.transport.mode is deprecated.
20:03:38.901 WARN org.apache.hive.jdbc.Utils: Please use transportMode like so: jdbc:hive2://<host>:<port>/dbName;transportMode=<transport_mode_value>
20:03:38.901 WARN org.apache.hive.jdbc.Utils: ***** JDBC param deprecation *****
20:03:38.901 WARN org.apache.hive.jdbc.Utils: The use of hive.server2.thrift.http.path is deprecated.
20:03:38.901 WARN org.apache.hive.jdbc.Utils: Please use httpPath like so: jdbc:hive2://<host>:<port>/dbName;httpPath=<http_path_value>
- SPARK-24829 Checks cast as float
Run completed in 52 seconds, 636 milliseconds.
Total number of tests run: 3
Suites: completed 2, aborted 0
Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Test build #132456 has finished for PR 30652 at commit b9f8eb2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Dec 9, 2020

Waiting for SPARK-33705 to fix 3 failed UTs in thriftsever module

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangyum wangyum closed this in cd0356d Dec 14, 2020
@wangyum
Copy link
Member

wangyum commented Dec 14, 2020

Thanks all. Merged to master.

@LuciferYang
Copy link
Contributor Author

Thanks for your review @wangyum @viirya @cloud-fan @gengliangwang ~

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132749 has finished for PR 30652 at commit 13bfda3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CheckAnalysis extends PredicateHelper with LookupCatalog
  • case class UnresolvedView(
  • case class Decode(params: Seq[Expression], child: Expression) extends RuntimeReplaceable
  • case class StringDecode(bin: Expression, charset: Expression)
  • case class NoopCommand(
  • case class ShowTableExtended(
  • case class DropView(
  • case class RepairTable(child: LogicalPlan) extends Command
  • case class AlterViewSetProperties(
  • case class AlterViewUnsetProperties(
  • case class CacheTable(
  • case class CacheTableAsSelect(
  • trait BaseCacheTableExec extends V2CommandExec
  • case class CacheTableExec(
  • case class CacheTableAsSelectExec(

dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2021
### What changes were proposed in this pull request?
not push down partition filter to `ORCScan` for DSv2

### Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1
```
== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>
```
Also, we don't push down partition filter for parquet in DSv2.
#30652

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suites

Closes #33680 from huaxingao/orc_filter.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2021
### What changes were proposed in this pull request?
not push down partition filter to `ORCScan` for DSv2

### Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1
```
== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>
```
Also, we don't push down partition filter for parquet in DSv2.
#30652

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suites

Closes #33680 from huaxingao/orc_filter.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b04330c)
Signed-off-by: Dongjoon Hyun <[email protected]>
@LuciferYang LuciferYang deleted the SPARK-33673 branch June 6, 2022 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants