[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection #25919

advancedxy · 2019-09-24T15:31:53Z

What changes were proposed in this pull request?

A new optimizer strategy called PruneHiveTablePartitions is added, which calculates table size as the total size of pruned partitions. Thus, Spark planner can pick up BroadcastJoin if the size of pruned partitions is under broadcast join threshold.

Why are the changes needed?

This is a performance improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

This is based on #18193, credits should go to @lianhuiwang.

advancedxy · 2019-09-24T15:32:48Z

cc @cloud-fan.

cloud-fan · 2019-09-24T16:39:35Z

ok to test

cloud-fan · 2019-09-24T16:39:45Z

add to whitelist

cloud-fan · 2019-09-24T16:41:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          predicate.references.subsetOf(partitionSet)
+      }
+      val conf = session.sessionState.conf
+      if (pruningPredicates.nonEmpty && conf.fallBackToHdfsForStatsEnabled &&


Why we need to check conf.fallBackToHdfsForStatsEnabled?

We should only get size from HDFS if conf.fallBackToHdfsForStatsEnabled? Since it could be a time-consuming operation.

Though, this condition should probably be pushed down to before the CommandUtils.calculateLocationSize call

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

SparkQA · 2019-09-24T18:10:40Z

Test build #111297 has finished for PR 25919 at commit ffb7168.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PruneHiveTablePartitions(

…tion

hive meta store.

advancedxy · 2019-10-20T04:16:49Z

@cloud-fan Now pruned partitions are cached in HiveTableRelation, what do you think about current approach ?

SparkQA · 2019-10-20T04:21:35Z

Test build #112331 has finished for PR 25919 at commit c054d22.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-10-20T04:24:10Z

Test build #112332 has finished for PR 25919 at commit e744da5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-20T06:24:47Z

Test build #112334 has finished for PR 25919 at commit 12e1dc5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-20T08:50:06Z

Test build #112337 has finished for PR 25919 at commit b334e99.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T07:04:08Z

Test build #112356 has finished for PR 25919 at commit 8d615f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-24T15:14:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

-    tableStats: Option[Statistics] = None) extends LeafNode with MultiInstanceRelation {
+    tableStats: Option[Statistics] = None,
+    @transient normalizedFilters: Seq[Expression] = Nil,
+    @transient prunedPartitions: Seq[CatalogTablePartition] = Nil)


How can we distinguish 0 partitions after pruning, and not being partition pruned?

We have another field called normalizedFilters, when it's empty(Nil), then the prunedPartitions are not pruned, otherwise it could be 0 partitions after pruning when prunedPartitions = Nil

cloud-fan · 2019-10-24T15:16:40Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+case class PruneHiveTablePartitions(
+  session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case filter @ Filter(condition, relation: HiveTableRelation) if relation.isPartitioned =>


can we follow PruneFileSourcePartitions? I think we should also support Filter(Project(HiveScan))

cloud-fan · 2019-10-24T15:18:08Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

-          normalizedFilters)
+        val isFiltersEqual = normalizedFilters.zip(relation.normalizedFilters)
+          .forall { case (e1, e2) => e1.semanticEquals(e2) }
+        if (isFiltersEqual) {


What are we doing here?

Only under exactly matched pruning filters, we can simply get partitions from HiveTableRelation

cloud-fan · 2019-10-24T15:21:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        val withStats = relation.tableMeta.copy(
+          stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
+        val prunedHiveTableRelation = relation.copy(tableMeta = withStats,
+          normalizedFilters = pruningPredicates, prunedPartitions = prunedPartitions)


Why do we need to keep pruningPredicates? IIUC the approach should be very simply:

this rule only changes HiveTableRelation to hold an optional partition list.

the HiveTableScanExec will get the partition list from HiveTableRelation or call listPartitionsByFilter.

Due to SPARK-24085, the pruningPredicates(we eliminate the subquery) could be different than the filters passed to HiveTableScan. So I keep the pruningPredicates, and only retrieves the prunedPartitions when HiveTableScanExec's pruningPartitionPredict matches exactly with HiveTableRelation's normalizedFilters.

The simplified solution occurred to me first, then I thought the filters could be different for some reason, and SPARK-24085 is an example, hence the proposed solution here.

1. don't store pruningFilters in HiveTableRelation 2. follow PruneFilSourcePartitions's style to extract projections, predicates and hive relation 3. skip partition pruning if scalar subquery is involved.

SparkQA · 2019-10-25T16:29:00Z

Test build #112679 has finished for PR 25919 at commit 86a0d9c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-26T07:05:02Z

Test build #112707 has finished for PR 25919 at commit ecfbe4d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2019-10-26T09:17:20Z

retest it please

advancedxy · 2019-11-22T06:32:11Z

Gently ping @cloud-fan

maropu · 2019-11-24T01:44:03Z

retest this please

maropu · 2019-11-24T01:44:10Z

still WIP?

SparkQA · 2019-11-24T05:50:25Z

Test build #114331 has finished for PR 25919 at commit ecfbe4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2019-11-25T05:02:26Z

still WIP?

I think it's ready for review.

fuwhu · 2019-12-02T06:57:40Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

        val normalizedFilters = partitionPruningPred.map(_.transform {
          case a: AttributeReference => originalAttributes(a)
        })
-        sparkSession.sessionState.catalog.listPartitionsByFilter(


@cloud-fan @maropu @advancedxy
Since the rawPartitions are called by "prunePartitions(rawPartitions)" in doExecute method, it seems prunePartitions will filter out all irrelevant partitions using "boundPruningPred". Then why we still need to call listpartitionsByFilter here ?
Could you please help me understand this ? thanks a lot in advance.

fuwhu · 2019-12-02T06:59:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        !predicate.references.isEmpty && predicate.references.subsetOf(partitionSet)
+      }
+      // SPARK-24085: scalar subquery should be skipped for partition pruning
+      val hasScalarSubquery = pruningPredicates.exists(SubqueryExpression.hasSubquery)


It skips all subqueries instead of scalar subqueries.

fuwhu · 2019-12-02T08:13:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+              rawDataSize.get
+            } else if (totalSize.isDefined && totalSize.get > 0L) {
+              totalSize.get
+            } else if (conf.fallBackToHdfsForStatsEnabled) {


Per the doc of the conf "spark.sql.statistics.fallBackToHdfs", it is only for non-partitioned hive table :
"This flag is effective only for non-partitioned Hive tables."

advancedxy · 2020-01-08T05:44:09Z

closed in favor of #26805

### What changes were proposed in this pull request? Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. ### Why are the changes needed? In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Added unit tests. This is based on #25919, credits should go to lianhuiwang and advancedxy. Closes #26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. no Added unit tests. This is based on apache#25919, credits should go to lianhuiwang and advancedxy. Closes apache#26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun added the SQL label Sep 24, 2019

cloud-fan reviewed Sep 24, 2019

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala Show resolved Hide resolved

advancedxy added 2 commits October 20, 2019 12:15

[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelec…

009944f

…tion

Keep pruned partition in HiveTableRelation to avoid multiple calls to

e744da5

hive meta store.

advancedxy force-pushed the SPARK-15616 branch from c054d22 to e744da5 Compare October 20, 2019 04:15

advancedxy changed the title ~~[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection~~ [WIP][SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection Oct 20, 2019

Fix compile error

12e1dc5

Fix compile error

b334e99

advancedxy added 2 commits October 21, 2019 00:07

Exclude normalizedFilters and prunedPartitions in canonicalization.

17e0ba0

Remove scalar subquery in partition expression.

8d615f7

cloud-fan reviewed Oct 24, 2019

View reviewed changes

Address comments:

86a0d9c

1. don't store pruningFilters in HiveTableRelation 2. follow PruneFilSourcePartitions's style to extract projections, predicates and hive relation 3. skip partition pruning if scalar subquery is involved.

Fix style issue.

ecfbe4d

advancedxy changed the title ~~[WIP][SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection~~ [SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection Nov 25, 2019

fuwhu reviewed Dec 2, 2019

View reviewed changes

fuwhu mentioned this pull request Dec 9, 2019

[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805

Closed

advancedxy closed this Jan 8, 2020

[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection #25919

[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection #25919

Uh oh!

Conversation

advancedxy commented Sep 24, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

advancedxy commented Sep 24, 2019

Uh oh!

cloud-fan commented Sep 24, 2019

Uh oh!

cloud-fan commented Sep 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

advancedxy commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 26, 2019

Uh oh!

advancedxy commented Oct 26, 2019

Uh oh!

advancedxy commented Nov 22, 2019

Uh oh!

maropu commented Nov 24, 2019

Uh oh!

maropu commented Nov 24, 2019

Uh oh!

SparkQA commented Nov 24, 2019

Uh oh!

advancedxy commented Nov 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy commented Jan 8, 2020

Uh oh!

Reviewers

Assignees

Labels

fuwhu Dec 2, 2019 •

edited

Loading