-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wangyum @cloud-fan |
|
ok to test |
|
Test build #115119 has finished for PR 26805 at commit
|
|
gently cc: @cloud-fan @maropu |
|
Test build #115934 has finished for PR 26805 at commit
|
|
retest this please |
|
Test build #115946 has finished for PR 26805 at commit
|
|
@cloud-fan @maropu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move it to a separated file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
def func(
para1: T,
para2: T...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 space indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the data source table have the same problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in PruneFileSourcePartitions, it also may lead to calculating size of large number of partitions through hdfs.
I will create a follow-up PR to refine it after this PR finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a common problem, let's leave it here and open a new PR to fix it completely later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, will create new PR for it. Already removed from this PR. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call SessionCatalog.listPartitionsByFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do it in DetermineTableStats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DetermineTableStats is Analyzer rule, while the pruned partitions and the size of them must be calculated
after filter push-down optimizers executed. So we can not put this part in DetermineTableStats now.
But I will check whether the DetermineTableStats can be moved to optimization phase and put after PruneHiveTablePartitions. If any idea/suggestion, please share. thanks.
|
Test build #116107 has finished for PR 26805 at commit
|
|
retest this please |
|
Test build #116111 has finished for PR 26805 at commit
|
|
Test build #116130 has finished for PR 26805 at commit
|
|
Test build #116141 has finished for PR 26805 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to keep other stats after we prune partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be inconsistent, eg. rowCount and sizeInBytes may be inconsistent after this rule.
So restored to creating new CatalogStatistics instance. But by doing so, some statistics may be lost which should not impact accuracy.
|
Did some refining, @cloud-fan please help review again. thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it have to be an external table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually not necessary, already changed to managed table. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking the accurate file size can be flaky. Can we just check that the first size is more than 3 times larger than the second size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will change it.
…ions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.
… to new PR as it is a common problem.
…firstly if HIVE_METASTORE_PARTITION_PRUNING enabled, and then prune again using partition filters.
… any more since HiveExternalCatalog.listPartitionsByFilter can already return exactly what we want.
…eSourcePartitions.getPartitionKeyFilters.
… available in metadata.
|
Test build #117147 has finished for PR 26805 at commit
|
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few code style issues.
| 0L | ||
| } | ||
| } | ||
| if (sizeOfPartitions.forall(s => s>0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit forall(_ > 0)
|
|
||
| for (part <- Seq(1, 2, 3, 4)) { | ||
| sql(s""" | ||
| |INSERT OVERWRITE TABLE test PARTITION (p='$part') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep the style of multiline string consistent. https://github.com/apache/spark/pull/26805/files#diff-90836f7778a704901d7d3df02846aa55R39 is corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use two-space indentation like PruneFileSourcePartitionsSuite.
| |INSERT OVERWRITE TABLE test PARTITION (p='$part') | ||
| |select col from temp""".stripMargin) | ||
| } | ||
| val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: p > 0
| |select col from temp""".stripMargin) | ||
| } | ||
| val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed | ||
| val analyzed2 = sql("select i from test where p=1").queryExecution.analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
| val analyzed1 = sql("select i from test where p>0").queryExecution.analyzed | ||
| val analyzed2 = sql("select i from test where p=1").queryExecution.analyzed | ||
| assert(Optimize.execute(analyzed1).stats.sizeInBytes/4 === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the code style, thanks a lot. :)
|
Test build #117166 has finished for PR 26805 at commit
|
|
thanks, merging to master! |
|
Thank you all for review and help. |
| import org.apache.spark.sql.internal.SQLConf | ||
|
|
||
| /** | ||
| * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuwhu We need a description about the rule. Could you submit a follow-up PR to add the descriptions to both PruneHiveTablePartitions and PruneFileSourcePartitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, so you mean just add class description in PruneHiveTablePartitions.scala and PruneFileSourcePartitions.scala file ? Or need to add comment in some doc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classdoc is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @cloud-fan classdoc added in #27535 , please help review, thanks.
Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. no Added unit tests. This is based on apache#25919, credits should go to lianhuiwang and advancedxy. Closes apache#26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns.
Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.
Why are the changes needed?
In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join.
Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance.
This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible.
Does this PR introduce any user-facing change?
no
How was this patch tested?
Added unit tests.
This is based on #25919, credits should go to @lianhuiwang and @advancedxy.