-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27739][SQL] df.persist should save stats from optimized plan #24623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| qe.executedPlan, | ||
| tableName, | ||
| planToCache) | ||
| qe.optimizedPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jzhuge . Could you add a test case for this, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun for the review. Unfortunately I couldn't find a good way to unit test the accuracy of InMemoryRelation.statsOfPlanToCache and there is no existing unit test for this field. Any suggestion is welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we can test this without filter pushdown in stats calculation.
We've added filter pushdown, so the stats methods use PhysicalOperation to detect scans paired with projections and filters. When getting the stats for a scan, those filters are passed so that we can get accurate stats. Without that optimization, the stats would be identical between the analyzed plan and the optimized plan.
|
ok to test |
|
Test build #105439 has finished for PR 24623 at commit
|
|
retest this please. |
|
Test build #105438 has finished for PR 24623 at commit
|
|
Test build #105440 has finished for PR 24623 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When re-caches some entries in the cache in recacheByCondition, it still uses analyzed plan's stats.
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
Lines 188 to 190 in 3e30a98
| val newCache = InMemoryRelation( | |
| cacheBuilder = cd.cachedRepresentation.cacheBuilder.copy(cachedPlan = plan), | |
| logicalPlan = cd.plan) |
|
How about renaming logicalPlan->optimizedPlan in InMemoryRelation?; spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala Line 149 in 70ef906
|
|
Test build #108493 has finished for PR 24623 at commit
|
|
Like the idea @maropu |
|
Test build #108538 has finished for PR 24623 at commit
|
|
@maropu @dongjoon-hyun @viirya Could you please take another look at this PR? I believe all comments have been addressed. |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have the leaf nodes that can provide stats. Not sure why we are unable to test the PR? Anybody can dig it deeper? This does not require any DSV2 change.
| cacheBuilder.cachedPlan.output, cacheBuilder, logicalPlan.outputOrdering) | ||
| relation.statsOfPlanToCache = logicalPlan.stats | ||
| cacheBuilder.cachedPlan.output, cacheBuilder, optimizedPlan.outputOrdering) | ||
| relation.statsOfPlanToCache = optimizedPlan.stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the stats of analyzed plan is the same as the optimized plan, please hold this PR until they become different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be different if we do DS v2 operator pushdown in the optimizer, but AFAIK it's not done yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats already differ in the v1 path because partition pruning happens in the optimizer for data source tables (PrunedInMemoryFileIndex).
I see no reason not to get this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's a good point, I agree with it. @jzhuge can we write a test using file source partition pruning? I'm not comfortable merging a fix without tests. The change itself LGTM.
|
Test build #109069 has finished for PR 24623 at commit
|
|
thanks, merging to master! |
|
thanks all for the reviews. thanks @cloud-fan for the merge. |
What changes were proposed in this pull request?
CacheManager.cacheQuery saves the stats from the optimized plan to cache.
How was this patch tested?
Existing testss.