Skip to content

Conversation

@jzhuge
Copy link
Member

@jzhuge jzhuge commented May 16, 2019

What changes were proposed in this pull request?

CacheManager.cacheQuery saves the stats from the optimized plan to cache.

How was this patch tested?

Existing testss.

qe.executedPlan,
tableName,
planToCache)
qe.optimizedPlan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jzhuge . Could you add a test case for this, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongjoon-hyun for the review. Unfortunately I couldn't find a good way to unit test the accuracy of InMemoryRelation.statsOfPlanToCache and there is no existing unit test for this field. Any suggestion is welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we can test this without filter pushdown in stats calculation.

We've added filter pushdown, so the stats methods use PhysicalOperation to detect scans paired with projections and filters. When getting the stats for a scan, those filters are passed so that we can get accurate stats. Without that optimization, the stats would be identical between the analyzed plan and the optimized plan.

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105439 has finished for PR 24623 at commit bb94d88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented May 16, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105438 has finished for PR 24623 at commit bb94d88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105440 has finished for PR 24623 at commit bb94d88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When re-caches some entries in the cache in recacheByCondition, it still uses analyzed plan's stats.

val newCache = InMemoryRelation(
cacheBuilder = cd.cachedRepresentation.cacheBuilder.copy(cachedPlan = plan),
logicalPlan = cd.plan)

@maropu
Copy link
Member

maropu commented Jul 31, 2019

How about renaming logicalPlan->optimizedPlan in InMemoryRelation?;

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108493 has finished for PR 24623 at commit e0e678b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jzhuge
Copy link
Member Author

jzhuge commented Aug 1, 2019

Like the idea @maropu

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108538 has finished for PR 24623 at commit 07f48fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jzhuge
Copy link
Member Author

jzhuge commented Aug 4, 2019

@maropu @dongjoon-hyun @viirya Could you please take another look at this PR? I believe all comments have been addressed.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have the leaf nodes that can provide stats. Not sure why we are unable to test the PR? Anybody can dig it deeper? This does not require any DSV2 change.

@gatorsmile
Copy link
Member

cc @cloud-fan @gengliangwang

cacheBuilder.cachedPlan.output, cacheBuilder, logicalPlan.outputOrdering)
relation.statsOfPlanToCache = logicalPlan.stats
cacheBuilder.cachedPlan.output, cacheBuilder, optimizedPlan.outputOrdering)
relation.statsOfPlanToCache = optimizedPlan.stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stats of analyzed plan is the same as the optimized plan, please hold this PR until they become different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be different if we do DS v2 operator pushdown in the optimizer, but AFAIK it's not done yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats already differ in the v1 path because partition pruning happens in the optimizer for data source tables (PrunedInMemoryFileIndex).

I see no reason not to get this in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's a good point, I agree with it. @jzhuge can we write a test using file source partition pruning? I'm not comfortable merging a fix without tests. The change itself LGTM.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109069 has finished for PR 24623 at commit f6312c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 391c7e8 Aug 14, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Aug 15, 2019

thanks all for the reviews. thanks @cloud-fan for the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants