[SPARK-15616] [SQL] CatalogRelation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. #18193

lianhuiwang · 2017-06-04T07:04:21Z

What changes were proposed in this pull request?

Currently if some partitions of a partitioned table are used in join operation we rely on Metastore returned size of table to calculate if we can convert the operation to Broadcast join.
if Filter can prune some partitions, Hive can prune partition before determining to use broadcast joins according to HDFS size of partitions that are involved in Query.So sparkSQL needs it that can improve join's performance for partitioned table.

How was this patch tested?

add unit tests.

SparkQA · 2017-06-04T09:31:54Z

Test build #77711 has finished for PR 18193 at commit 5591a1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DeterminePartitionedTableStats(

cloud-fan · 2017-06-04T17:03:56Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

          fsRelation.copy(location = prunedFileIndex)(sparkSession)
-        val prunedLogicalRelation = logicalRelation.copy(relation = prunedFsRelation)
+        val withStats = logicalRelation.catalogTable.map(_.copy(
+          stats = Some(CatalogStatistics(sizeInBytes = BigInt(prunedFileIndex.sizeInBytes)))))


good catch! I think this is a bug and worth a separated PR to fix it.

yes, i have created SPARK-20986. thanks.

dongjoon-hyun · 2017-06-05T04:12:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala

      customCheckRules
  }

+  /**


nit: Indentation.

yes, thanks.

SparkQA · 2017-06-05T14:14:45Z

Test build #77742 has finished for PR 18193 at commit 754af2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-05T16:57:58Z

Test build #77745 has finished for PR 18193 at commit c44a589.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-06T02:37:44Z

Test build #77764 has finished for PR 18193 at commit 171a9e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…on_broadcast_2

cloud-fan · 2017-06-13T06:41:42Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+    session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+    case filter @ Filter(condition, relation: CatalogRelation)
+      if DDLUtils.isHiveTable(relation.tableMeta) && relation.isPartitioned =>


it's only for hive table? what about data source table?

I think that PruneFileSourcePartitions can be for data source table now.

cloud-fan · 2017-06-13T06:42:45Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

  }
 }

+case class DeterminePartitionedTableStats(


is it kind of a PruneFileSourcePartitions rule for hive tables?

Yes, DeterminePartitionedTableStats is kind of a PruneFileSourcePartitions rule for Hive tables.

then shall we give it a better name, like PruneHiveTablePartitions ?

yes, i will rename it to PruneHiveTablePartitions.

cloud-fan · 2017-06-13T06:45:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          session.sessionState.conf.sessionLocalTimeZone)
+        val hiveTable = HiveClientImpl.toHiveTable(relation.tableMeta)
+        val partitions = prunedPartitions.map(HiveClientImpl.toHivePartition(_, hiveTable))
+        val sizeInBytes = try {


what if we already have partition level statistics at hive metastore?

If we already have partition level statistics, But we cannot know total number of partition, so it cannot compute the statistics for pruned partitions.

cenyuhai · 2017-06-25T17:12:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          partitions.map { partition =>
+            val fs: FileSystem = partition.getDataLocation.getFileSystem(hadoopConf)
+            fs.getContentSummary(partition.getDataLocation).getLength
+          }.sum


if there are too many partitions, it will be very slow.
can you add a check that whether the sum is larger than threshold, if true then break.

cloud-fan · 2017-06-26T08:02:19Z

The logic looks similar to PruneFileSourcePartitions, can we just merge this new rule to it?

gatorsmile · 2017-08-18T16:56:47Z

ping @lianhuiwang

…on_broadcast_2

SparkQA · 2017-08-22T15:48:51Z

Test build #80991 has finished for PR 18193 at commit 4344fbd.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PruneHiveTablePartitions(

lianhuiwang · 2017-08-22T15:59:03Z

@cloud-fan PruneFileSourcePartitions is kind of a rule for datasource, But now we cannot make hive as a data source.

SparkQA · 2017-08-22T15:59:22Z

Test build #80993 has finished for PR 18193 at commit ff59140.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2017-08-22T16:01:27Z

retest it please.

SparkQA · 2017-08-22T17:48:41Z

Test build #80994 has finished for PR 18193 at commit 37b20c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-23T00:50:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+    session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+    case filter @ Filter(condition, relation: HiveTableRelation)
+      if DDLUtils.isHiveTable(relation.tableMeta) && relation.isPartitioned =>


DDLUtils.isHiveTable(relation.tableMeta) is no longer needed

cloud-fan · 2017-08-23T00:52:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

  }
 }

+case class PruneHiveTablePartitions(


add a todo that we should merge this rule with PruneFileSourcePartitions, after we completely make hive a data source.

cloud-fan · 2017-08-23T00:57:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          pruningPredicates,
+          session.sessionState.conf.sessionLocalTimeZone)
+        val hiveTable = HiveClientImpl.toHiveTable(relation.tableMeta)
+        val partitions = prunedPartitions.map(HiveClientImpl.toHivePartition(_, hiveTable))


do we need to do this? All we need is partition data location, and we can get it by CatalogTablePartition.storage.locationUri

lianhuiwang · 2017-08-25T16:53:47Z

@cloud-fan I have address your comments. Thanks.

SparkQA · 2017-08-25T18:45:46Z

Test build #81134 has finished for PR 18193 at commit 118f4bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cenyuhai · 2017-09-17T15:07:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+          pruningPredicates,
+          session.sessionState.conf.sessionLocalTimeZone)
+        val sizeInBytes = try {
+          prunedPartitions.map { part =>


I think we should first check whether partition.parameters contains SetupConst.RAW_DATA_SIZE and SetupConst.TOTAL_SIZE) or not. If partition.parameters contains the size of the partition, use it instead of getConetSummary of hdfs

@cenyuhai yes,Good idea. I will add it.Thanks.

SparkQA · 2017-09-19T18:21:25Z

Test build #81937 has finished for PR 18193 at commit fdd63c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-19T18:22:54Z

Test build #81939 has finished for PR 18193 at commit fd95fb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cenyuhai · 2017-09-20T02:14:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        val sizeInBytes = try {
+          prunedPartitions.map { part =>
+            val totalSize = part.parameters.get(StatsSetupConst.TOTAL_SIZE).map(_.toLong)
+            val rawDataSize = part.parameters.get(StatsSetupConst.RAW_DATA_SIZE).map(_.toLong)


I think we should first use rawDataSize, because 1MB orc file is equal to 5MB textfile...

@cenyuhai Yes,I think what you said is right.Thanks.

SparkQA · 2017-09-20T07:04:44Z

Test build #81968 has finished for PR 18193 at commit 72f63fa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-29T17:58:20Z

Test build #88712 has finished for PR 18193 at commit 72f63fa.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

Achuth17 · 2018-07-17T18:04:32Z

@cloud-fan @lianhuiwang @gatorsmile This fix is useful, is there any update on this?

cloud-fan · 2019-09-17T06:38:52Z

@advancedxy do you want to take over it? The PR looks good but we probably need to fix some tests.

advancedxy · 2019-09-17T07:28:45Z

@advancedxy do you want to take over it? The PR looks good but we probably need to fix some tests.

Ok, I will add it to my backlog. Will take a look at this while/once #18324 is resolved by me.

github-actions · 2020-01-17T00:13:46Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

init commit

5591a1c

lianhuiwang mentioned this pull request Jun 4, 2017

[SPARK-15616] [SQL] Metastore relation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. #13373

Closed

lianhuiwang changed the title ~~[SPARK-15616] [SQL] Metastore relation should fallback to HDFS size of partitions that are involved in Query for JoinSelection.~~ [SPARK-15616] [SQL] CatalogRelation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. Jun 4, 2017

cloud-fan reviewed Jun 4, 2017

View reviewed changes

dongjoon-hyun reviewed Jun 5, 2017

View reviewed changes

fix code style.

754af2f

update

c44a589

update

171a9e6

Merge branch 'master' of https://github.com/apache/spark into partiti…

260202b

…on_broadcast_2

cloud-fan reviewed Jun 13, 2017

View reviewed changes

wzhfy mentioned this pull request Jun 13, 2017

[SPARK-16669][SQL]Adding partition prunning to Metastore statistics f… #14655

Closed

cenyuhai reviewed Jun 25, 2017

View reviewed changes

lianhuiwang added 2 commits August 22, 2017 23:40

rename to PruneHiveTablePartitions

4344fbd

Merge branch 'master' of https://github.com/apache/spark into partiti…

ff59140

…on_broadcast_2

fix errors

37b20c7

cloud-fan reviewed Aug 23, 2017

View reviewed changes

address comments

118f4bc

cenyuhai reviewed Sep 17, 2017

View reviewed changes

lianhuiwang added 2 commits September 20, 2017 00:16

merge branch 'master'

fdd63c3

use totalSize/rawDataSize when exist

fd95fb3

cenyuhai reviewed Sep 20, 2017

View reviewed changes

use rawDataSize firstly

72f63fa

dongjoon-hyun added the SQL label Jun 14, 2019

advancedxy mentioned this pull request Sep 24, 2019

[SPARK-15616][SQL] Hive table supports partition pruning in JoinSelection #25919

Closed

github-actions bot added the Stale label Jan 17, 2020

github-actions bot closed this Jan 18, 2020

[SPARK-15616] [SQL] CatalogRelation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. #18193

[SPARK-15616] [SQL] CatalogRelation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. #18193

Uh oh!

Conversation

lianhuiwang commented Jun 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 5, 2017

Uh oh!

SparkQA commented Jun 5, 2017

Uh oh!

SparkQA commented Jun 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 26, 2017

Uh oh!

gatorsmile commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

lianhuiwang commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

lianhuiwang commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang commented Aug 25, 2017

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

cenyuhai Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2017

Uh oh!

SparkQA commented Sep 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 13, 2017 •

edited

Loading

cenyuhai Sep 17, 2017 •

edited

Loading

Achuth17 commented Jul 17, 2018 •

edited

Loading