[SPARK-27739][SQL] df.persist should save stats from optimized plan #24623

jzhuge · 2019-05-16T01:33:36Z

What changes were proposed in this pull request?

CacheManager.cacheQuery saves the stats from the optimized plan to cache.

How was this patch tested?

Existing testss.

dongjoon-hyun · 2019-05-16T01:42:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+        qe.executedPlan,
        tableName,
-        planToCache)
+        qe.optimizedPlan)


Hi, @jzhuge . Could you add a test case for this, too?

Thanks @dongjoon-hyun for the review. Unfortunately I couldn't find a good way to unit test the accuracy of InMemoryRelation.statsOfPlanToCache and there is no existing unit test for this field. Any suggestion is welcome.

I'm not sure that we can test this without filter pushdown in stats calculation.

We've added filter pushdown, so the stats methods use PhysicalOperation to detect scans paired with projections and filters. When getting the stats for a scan, those filters are passed so that we can get accurate stats. Without that optimization, the stats would be identical between the analyzed plan and the optimized plan.

dongjoon-hyun · 2019-05-16T01:43:12Z

ok to test

SparkQA · 2019-05-16T03:04:17Z

Test build #105439 has finished for PR 24623 at commit bb94d88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-16T03:25:49Z

retest this please.

SparkQA · 2019-05-16T04:35:21Z

Test build #105438 has finished for PR 24623 at commit bb94d88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-16T04:41:40Z

Test build #105440 has finished for PR 24623 at commit bb94d88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

When re-caches some entries in the cache in recacheByCondition, it still uses analyzed plan's stats.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

Lines 188 to 190 in 3e30a98

    
           val newCache = InMemoryRelation( 
        
             cacheBuilder = cd.cachedRepresentation.cacheBuilder.copy(cachedPlan = plan), 
        
             logicalPlan = cd.plan)

maropu · 2019-07-31T23:54:33Z

How about renaming logicalPlan->optimizedPlan in InMemoryRelation?;

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

Line 149 in 70ef906

logicalPlan: LogicalPlan): InMemoryRelation = {

SparkQA · 2019-08-01T00:23:33Z

Test build #108493 has finished for PR 24623 at commit e0e678b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jzhuge · 2019-08-01T01:13:02Z

Like the idea @maropu

SparkQA · 2019-08-02T05:22:16Z

Test build #108538 has finished for PR 24623 at commit 07f48fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jzhuge · 2019-08-04T19:12:01Z

@maropu @dongjoon-hyun @viirya Could you please take another look at this PR? I believe all comments have been addressed.

gatorsmile

We do have the leaf nodes that can provide stats. Not sure why we are unable to test the PR? Anybody can dig it deeper? This does not require any DSV2 change.

gatorsmile · 2019-08-07T05:28:03Z

cc @cloud-fan @gengliangwang

gatorsmile · 2019-08-07T05:31:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      cacheBuilder.cachedPlan.output, cacheBuilder, logicalPlan.outputOrdering)
-    relation.statsOfPlanToCache = logicalPlan.stats
+      cacheBuilder.cachedPlan.output, cacheBuilder, optimizedPlan.outputOrdering)
+    relation.statsOfPlanToCache = optimizedPlan.stats


If the stats of analyzed plan is the same as the optimized plan, please hold this PR until they become different.

They can be different if we do DS v2 operator pushdown in the optimizer, but AFAIK it's not done yet.

Stats already differ in the v1 path because partition pruning happens in the optimizer for data source tables (PrunedInMemoryFileIndex).

I see no reason not to get this in.

ah that's a good point, I agree with it. @jzhuge can we write a test using file source partition pruning? I'm not comfortable merging a fix without tests. The change itself LGTM.

SparkQA · 2019-08-14T04:05:21Z

Test build #109069 has finished for PR 24623 at commit f6312c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-14T11:50:12Z

thanks, merging to master!

jzhuge · 2019-08-15T03:23:18Z

thanks all for the reviews. thanks @cloud-fan for the merge.

dongjoon-hyun reviewed May 16, 2019

View reviewed changes

viirya reviewed May 16, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

maropu approved these changes Aug 6, 2019

View reviewed changes

gatorsmile requested changes Aug 7, 2019

View reviewed changes

gatorsmile reviewed Aug 7, 2019

View reviewed changes

jzhuge added 4 commits August 12, 2019 22:26

[SPARK-27739][SQL] df.persist should save stats from optimized plan

efc6509

Liang-Chi's comment

9a16d8a

Rename parameter logicalPlan to optimizedPlan

42d5ff4

Add unit test

f6312c1

jzhuge force-pushed the SPARK-27739 branch from 07f48fb to f6312c1 Compare August 14, 2019 00:32

cloud-fan closed this in 391c7e8 Aug 14, 2019

	val newCache = InMemoryRelation(
	cacheBuilder = cd.cachedRepresentation.cacheBuilder.copy(cachedPlan = plan),
	logicalPlan = cd.plan)

[SPARK-27739][SQL] df.persist should save stats from optimized plan #24623

[SPARK-27739][SQL] df.persist should save stats from optimized plan #24623

Uh oh!

Conversation

jzhuge commented May 16, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun May 16, 2019

Choose a reason for hiding this comment

Uh oh!

jzhuge May 16, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue May 16, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 16, 2019

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

viirya commented May 16, 2019

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 31, 2019

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

jzhuge commented Aug 1, 2019

Uh oh!

SparkQA commented Aug 2, 2019

Uh oh!

jzhuge commented Aug 4, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 7, 2019

Uh oh!

gatorsmile Aug 7, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 7, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 13, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 14, 2019

Uh oh!

cloud-fan commented Aug 14, 2019

Uh oh!

jzhuge commented Aug 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants