[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805

fuwhu · 2019-12-09T08:27:55Z

What changes were proposed in this pull request?

Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns.
Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.

Why are the changes needed?

In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join.
Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance.
This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible.

Does this PR introduce any user-facing change?

no

How was this patch tested?

Added unit tests.

This is based on #25919, credits should go to @lianhuiwang and @advancedxy.

fuwhu · 2019-12-09T08:28:41Z

@wangyum @cloud-fan
Could you please help review ?

cloud-fan · 2019-12-10T18:26:43Z

ok to test

SparkQA · 2019-12-10T23:01:11Z

Test build #115119 has finished for PR 26805 at commit 0b60978.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PruneHiveTablePartitions(session: SparkSession)

fuwhu · 2019-12-13T08:27:03Z

gently cc: @cloud-fan @maropu

SparkQA · 2019-12-30T08:05:02Z

Test build #115934 has finished for PR 26805 at commit 4e1aba9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PruneHiveTablePartitions(session: SparkSession)

fuwhu · 2019-12-30T08:52:15Z

retest this please

SparkQA · 2019-12-30T12:57:57Z

Test build #115946 has finished for PR 26805 at commit 4e1aba9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PruneHiveTablePartitions(session: SparkSession)

fuwhu · 2019-12-31T02:20:24Z

@cloud-fan @maropu
Could you help review? Or I can close it if this change is not needed.

cloud-fan · 2020-01-02T08:23:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

let's move it to a separated file.

sure, thanks.

cloud-fan · 2020-01-02T08:23:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

nit:

def func( para1: T, para2: T...

sure, thanks.

cloud-fan · 2020-01-02T08:23:40Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

2 space indentation.

sure, thanks.

cloud-fan · 2020-01-02T08:26:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

does the data source table have the same problem?

Yes, in PruneFileSourcePartitions, it also may lead to calculating size of large number of partitions through hdfs.
I will create a follow-up PR to refine it after this PR finished.

If this is a common problem, let's leave it here and open a new PR to fix it completely later.

yea, will create new PR for it. Already removed from this PR. thanks.

cloud-fan · 2020-01-02T08:26:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

We should call SessionCatalog.listPartitionsByFilter

yea, updated.

cloud-fan · 2020-01-02T08:27:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

shall we do it in DetermineTableStats?

DetermineTableStats is Analyzer rule, while the pruned partitions and the size of them must be calculated
after filter push-down optimizers executed. So we can not put this part in DetermineTableStats now.
But I will check whether the DetermineTableStats can be moved to optimization phase and put after PruneHiveTablePartitions. If any idea/suggestion, please share. thanks.

SparkQA · 2020-01-04T08:05:02Z

Test build #116107 has finished for PR 26805 at commit a51e946.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-04T09:41:36Z

retest this please

SparkQA · 2020-01-04T13:50:12Z

Test build #116111 has finished for PR 26805 at commit a51e946.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-06T08:05:01Z

Test build #116130 has finished for PR 26805 at commit d715a04.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-06T12:08:49Z

Test build #116141 has finished for PR 26805 at commit 4b8d39d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-08T12:09:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

is it safe to keep other stats after we prune partitions?

could be inconsistent, eg. rowCount and sizeInBytes may be inconsistent after this rule.
So restored to creating new CatalogStatistics instance. But by doing so, some statistics may be lost which should not impact accuracy.

fuwhu · 2020-01-09T07:21:18Z

Did some refining, @cloud-fan please help review again. thanks.

cloud-fan · 2020-01-20T08:35:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

does it have to be an external table?

actually not necessary, already changed to managed table. thanks.

cloud-fan · 2020-01-20T08:37:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

checking the accurate file size can be flaky. Can we just check that the first size is more than 3 times larger than the second size?

sure, will change it.

…ions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.

… to new PR as it is a common problem.

…firstly if HIVE_METASTORE_PARTITION_PRUNING enabled, and then prune again using partition filters.

… any more since HiveExternalCatalog.listPartitionsByFilter can already return exactly what we want.

…eSourcePartitions.getPartitionKeyFilters.

… available in metadata.

SparkQA · 2020-01-21T07:17:07Z

Test build #117147 has finished for PR 26805 at commit ce20439.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM except a few code style issues.

cloud-fan · 2020-01-21T07:45:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

+        0L
+      }
+    }
+    if (sizeOfPartitions.forall(s => s>0)) {


nit forall(_ > 0)

cloud-fan · 2020-01-21T07:46:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

+
+      for (part <- Seq(1, 2, 3, 4)) {
+        sql(s"""
+               |INSERT OVERWRITE TABLE test PARTITION (p='$part')


please keep the style of multiline string consistent. https://github.com/apache/spark/pull/26805/files#diff-90836f7778a704901d7d3df02846aa55R39 is corrected.

updated to use two-space indentation like PruneFileSourcePartitionsSuite.

cloud-fan · 2020-01-21T07:46:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

+               |INSERT OVERWRITE TABLE test PARTITION (p='$part')
+               |select col from temp""".stripMargin)
+      }
+      val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed


nit: p > 0

cloud-fan · 2020-01-21T07:46:24Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

+               |select col from temp""".stripMargin)
+      }
+      val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed
+      val analyzed2 = sql("select i from test where p=1").queryExecution.analyzed


cloud-fan · 2020-01-21T07:46:33Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala

+      }
+      val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed
+      val analyzed2 = sql("select i from test where p=1").queryExecution.analyzed
+      assert(Optimize.execute(analyzed1).stats.sizeInBytes/4 ===


updated the code style, thanks a lot. :)

SparkQA · 2020-01-21T12:21:25Z

Test build #117166 has finished for PR 26805 at commit b1798d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-21T13:26:44Z

thanks, merging to master!

fuwhu · 2020-01-22T02:31:26Z

Thank you all for review and help.

gatorsmile · 2020-02-08T06:43:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source.


@fuwhu We need a description about the rule. Could you submit a follow-up PR to add the descriptions to both PruneHiveTablePartitions and PruneFileSourcePartitions?

sure, so you mean just add class description in PruneHiveTablePartitions.scala and PruneFileSourcePartitions.scala file ? Or need to add comment in some doc ?

classdoc is good enough

@gatorsmile @cloud-fan classdoc added in #27535 , please help review, thanks.

Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. no Added unit tests. This is based on apache#25919, credits should go to lianhuiwang and advancedxy. Closes apache#26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun added the SQL label Dec 9, 2019

fuwhu force-pushed the SPARK-15616 branch from 0b60978 to 4e1aba9 Compare December 30, 2019 05:46

cloud-fan reviewed Jan 2, 2020

View reviewed changes

fuwhu force-pushed the SPARK-15616 branch from 4e1aba9 to a51e946 Compare January 4, 2020 05:25

fuwhu force-pushed the SPARK-15616 branch from a51e946 to d715a04 Compare January 6, 2020 05:38

advancedxy mentioned this pull request Jan 8, 2020

[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection #25919

Closed

fuwhu mentioned this pull request Jan 8, 2020

[SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System #27129

Closed

fuwhu changed the title ~~[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions~~ [SPARK-15616][SQL][WIP] Add optimizer rule PruneHiveTablePartitions Jan 8, 2020

cloud-fan reviewed Jan 8, 2020

View reviewed changes

fuwhu force-pushed the SPARK-15616 branch from 4b8d39d to 0218202 Compare January 9, 2020 07:16

fuwhu changed the title ~~[SPARK-15616][SQL][WIP] Add optimizer rule PruneHiveTablePartitions~~ [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions Jan 9, 2020

cloud-fan reviewed Jan 20, 2020

View reviewed changes

fuwhu added 14 commits January 21, 2020 10:27

Add optimizer rule PruneHiveTablePartitions pruning hive table partit…

8981759

…ions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.

Refine code.

cd4af95

Remove conf item FALL_BACK_TO_HDFS_FOR_STATS_MAX_PART_NUM, leaving it…

e4698c5

… to new PR as it is a common problem.

Refine code.

8987233

Add PruneHiveTablePartitionsSute.

1eafe1e

Refine PruneHiveTablePartitions : prune partitions through metasotre …

5dd01fd

…firstly if HIVE_METASTORE_PARTITION_PRUNING enabled, and then prune again using partition filters.

Fix indentation.

79e5cf9

Drop sizeInBytes of partition when it can't be got from metadata.

7fa3718

Don't need to prune again in PruneHiveTablePartitions.prunePartitions…

0b21e77

… any more since HiveExternalCatalog.listPartitionsByFilter can already return exactly what we want.

Make PruneHiveTablePartitions.getPartitionKeyFilters follows PruneFil…

a9ce634

…eSourcePartitions.getPartitionKeyFilters.

empty commit

fea6fdc

leave statistic unchanged if the sizeInBytes of some partition is not…

14ae878

… available in metadata.

refine code.

6a4a4b2

Refine code.

ce20439

fuwhu force-pushed the SPARK-15616 branch from 23d6d03 to ce20439 Compare January 21, 2020 02:58

cloud-fan approved these changes Jan 21, 2020

View reviewed changes

Fix scala code style.

b1798d5

cloud-fan closed this in cfb1706 Jan 21, 2020

fuwhu deleted the SPARK-15616 branch January 22, 2020 02:31

fuwhu restored the SPARK-15616 branch January 22, 2020 02:31

fuwhu deleted the SPARK-15616 branch January 22, 2020 02:31

gatorsmile reviewed Feb 8, 2020

View reviewed changes

[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805

[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805

Uh oh!

Conversation

fuwhu commented Dec 9, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

fuwhu commented Dec 9, 2019

Uh oh!

cloud-fan commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

fuwhu commented Dec 13, 2019

Uh oh!

SparkQA commented Dec 30, 2019

Uh oh!

fuwhu commented Dec 30, 2019

Uh oh!

SparkQA commented Dec 30, 2019

Uh oh!

fuwhu commented Dec 31, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 4, 2020

Uh oh!

fuwhu commented Jan 4, 2020

Uh oh!

SparkQA commented Jan 4, 2020

Uh oh!

SparkQA commented Jan 6, 2020

Uh oh!

SparkQA commented Jan 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu commented Jan 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fuwhu Jan 4, 2020 •

edited

Loading

fuwhu Jan 4, 2020 •

edited

Loading

fuwhu Jan 9, 2020 •

edited

Loading

fuwhu Jan 21, 2020 •

edited

Loading