[SPARK-19350] [SQL] Cardinality estimation of Limit and Sample #16696

wzhfy · 2017-01-25T02:57:53Z

What changes were proposed in this pull request?

Before this pr, LocalLimit/GlobalLimit/Sample propagates the same row count and column stats from its child, which is incorrect.
We can get the correct rowCount in Statistics for GlobalLimit/Sample whether cbo is enabled or not.
We don't know the rowCount for LocalLimit because we don't know the partition number at that time. Column stats should not be propagated because we don't know the distribution of columns after Limit or Sample.

How was this patch tested?

Added test cases.

wzhfy · 2017-01-25T03:04:01Z

cc @cloud-fan @gatorsmile please review

SparkQA · 2017-01-25T04:45:42Z

Test build #71963 has finished for PR 16696 at commit 62013f5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T04:51:26Z

Test build #71964 has finished for PR 16696 at commit b88fac5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class LimitNode extends UnaryNode
case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode
case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode

SparkQA · 2017-01-25T15:27:34Z

Test build #71988 has finished for PR 16696 at commit 05fcd81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-26T03:57:44Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Why don't we use childStats.rowCount? If childStats.rowCount is less than limit number, I think we should use it instead of limit.

I think @wzhfy is just keeping the existing code logics. Sure, we can improve it.

We should. Otherwise the rowCount is not correct.

Agreed. We can pick the smaller value between the child node's row count and the limit number.

gatorsmile · 2017-01-26T04:53:28Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

ceil -> ceiling

gatorsmile · 2017-01-26T04:55:07Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sampledNumber -> sampledRowCount

gatorsmile · 2017-01-26T05:03:06Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

To make the stats more accurate, yes, we can use a smaller number between childStats.rowCounts and limit as outputRowCount of getOutputSize

gatorsmile · 2017-01-26T05:17:52Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

I still prefer to adding a comment above this line:

// rowCount * (overhead + column size)

gatorsmile · 2017-01-26T05:21:34Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

Could you replace the above three lines by checkStats?

gatorsmile · 2017-01-26T05:23:10Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

You know, this is a utility function. We can make it more general by having two expected stats values

gatorsmile · 2017-01-26T05:25:10Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

rename plan2 to childPlan

gatorsmile · 2017-01-26T05:25:29Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

The same here

For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.

gatorsmile · 2017-01-26T05:29:34Z

Overall looks good to me. : ) Could you add a few more test cases?

One is the child has less row counts than the limit.
Another is having zero row counts but sizeInBytes is not zero.

viirya · 2017-01-26T14:26:56Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Actually the rowCount for LocalLimit and GlobalLimit should be different. For LocalLimit, limit is just the row count for one partition. But we can't get the number of partitions here, I think. As the actual row number might be quite bigger than the limit, maybe we should set it as None.

This is a good point, maybe we should still separate the stats calculation of GlobalLimit and LocalLimit.

ron8hu · 2017-02-04T20:27:47Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.

cloud-fan · 2017-02-07T14:47:46Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

do we assume limitExpr is foldable? but seems there is no type checking logic for it.

SparkQA · 2017-02-24T01:10:37Z

Test build #73372 has finished for PR 16696 at commit 5692939.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-02-24T03:27:07Z

retest this please

SparkQA · 2017-02-24T05:29:40Z

Test build #73391 has finished for PR 16696 at commit 5692939.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-02-24T06:41:47Z

@cloud-fan @gatorsmile I've updated this pr and also added test cases, please review.

cloud-fan · 2017-03-03T07:23:37Z

retest this please

cloud-fan · 2017-03-03T07:27:02Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsConfSuite.scala

why remove this test suite?

I didn't remove it, just renamed it.

Can you use git mv? Then, it will keep the change history.

How to use git mv now? Do I need to revert to the unchanged version, and git mv, and then do all the changes all over again?

SparkQA · 2017-03-03T07:27:34Z

Test build #73824 has started for PR 16696 at commit 5692939.

cloud-fan · 2017-03-03T07:27:37Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationSuite.scala

BasicStatsEstimationSuite?

Good name:)

cloud-fan · 2017-03-03T07:29:51Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

is this test duplicated with the newly added limit test?

yea I think so, let me remove it.

SparkQA · 2017-03-03T15:19:57Z

Test build #73845 has finished for PR 16696 at commit 516b114.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class BasicStatsEstimationSuite extends StatsEstimationTestBase

cloud-fan · 2017-03-03T20:08:32Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

but I think the max/min should still be corrected?

How can we make sure max/min values are still there after limit? Otherwise it will be a very loose bound of max/min.

hmm, what's the strategy here? is a loose bound better than nothing?

A loose bound can lead to significant under-estimation. E.g. a > 50, after local limit the actual range is [40, 60], while max and min in stats are still [0, 60], then the filter factor will be 1/6 instead of 1/2.

wzhfy · 2017-03-06T05:11:42Z

@cloud-fan Does this look good to you now?

cloud-fan · 2017-03-06T21:25:32Z

...src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala

nit:

val attr = ... vak colStat = ...

cloud-fan · 2017-03-06T21:27:11Z

LGTM except one minor comment

gatorsmile · 2017-03-06T22:58:30Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Nit: the whole sentence does not have a period. How about rewriting it like?

The output row count of LocalLimit should be the sum of row counts from each partition. However, since the number of partitions is not available here, we just use statistics of the child. Because the distirubtion after a limit operation is unknown, we do not propapage the column stats.

gatorsmile · 2017-03-06T23:16:55Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Nit: val rowCount: BigInt = childStats.rowCount.map(_.min(limit)).getOrElse(limit)

gatorsmile · 2017-03-06T23:18:10Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Nit: childStats.copy(attributeStats = AttributeMap(Nil))

gatorsmile · 2017-03-06T23:21:25Z

LGTM except minor comments.

SparkQA · 2017-03-07T04:32:16Z

Test build #74060 has finished for PR 16696 at commit 0c42ea2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-07T05:45:50Z

Thanks! Merging to master.

wzhfy force-pushed the limitEstimation branch from 62013f5 to b88fac5 Compare January 25, 2017 03:03

viirya reviewed Jan 26, 2017

View reviewed changes

gatorsmile reviewed Jan 26, 2017

View reviewed changes

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala Outdated

Copy link

Member

gatorsmile Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceil -> ceiling

gatorsmile reviewed Jan 26, 2017

View reviewed changes

viirya reviewed Jan 26, 2017

View reviewed changes

ron8hu suggested changes Feb 4, 2017

View reviewed changes

cloud-fan reviewed Feb 7, 2017

View reviewed changes

wzhfy force-pushed the limitEstimation branch from 05fcd81 to 5692939 Compare February 24, 2017 00:04

cloud-fan reviewed Mar 3, 2017

View reviewed changes

wzhfy force-pushed the limitEstimation branch from 5692939 to 516b114 Compare March 3, 2017 13:19

cloud-fan reviewed Mar 3, 2017

View reviewed changes

cloud-fan reviewed Mar 6, 2017

View reviewed changes

gatorsmile reviewed Mar 6, 2017

View reviewed changes

wangzhenhua added 3 commits March 7, 2017 10:07

rename

410c75e

limit and sample estimation

b96aa49

fix comments

0c42ea2

wzhfy force-pushed the limitEstimation branch from 516b114 to 0c42ea2 Compare March 7, 2017 02:27

asfgit closed this in 9909f6d Mar 7, 2017

[SPARK-19350] [SQL] Cardinality estimation of Limit and Sample #16696

[SPARK-19350] [SQL] Cardinality estimation of Limit and Sample #16696

Uh oh!

Conversation

wzhfy commented Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

wzhfy commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

wzhfy commented Feb 24, 2017

Uh oh!

cloud-fan commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wzhfy commented Jan 25, 2017 •

edited

Loading

wzhfy Mar 3, 2017 •

edited

Loading

wzhfy Mar 4, 2017 •

edited

Loading