Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Jan 25, 2017

What changes were proposed in this pull request?

Before this pr, LocalLimit/GlobalLimit/Sample propagates the same row count and column stats from its child, which is incorrect.
We can get the correct rowCount in Statistics for GlobalLimit/Sample whether cbo is enabled or not.
We don't know the rowCount for LocalLimit because we don't know the partition number at that time. Column stats should not be propagated because we don't know the distribution of columns after Limit or Sample.

How was this patch tested?

Added test cases.

@wzhfy
Copy link
Contributor Author

wzhfy commented Jan 25, 2017

cc @cloud-fan @gatorsmile please review

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71963 has finished for PR 16696 at commit 62013f5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71964 has finished for PR 16696 at commit b88fac5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class LimitNode extends UnaryNode
  • case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode
  • case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends LimitNode

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71988 has finished for PR 16696 at commit 05fcd81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use childStats.rowCount? If childStats.rowCount is less than limit number, I think we should use it instead of limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @wzhfy is just keeping the existing code logics. Sure, we can improve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should. Otherwise the rowCount is not correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We can pick the smaller value between the child node's row count and the limit number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceil -> ceiling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampledNumber -> sampledRowCount

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the stats more accurate, yes, we can use a smaller number between childStats.rowCounts and limit as outputRowCount of getOutputSize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to adding a comment above this line:

      // rowCount * (overhead + column size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace the above three lines by checkStats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, this is a utility function. We can make it more general by having two expected stats values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename plan2 to childPlan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.

@gatorsmile
Copy link
Member

Overall looks good to me. : ) Could you add a few more test cases?

  • One is the child has less row counts than the limit.
  • Another is having zero row counts but sizeInBytes is not zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the rowCount for LocalLimit and GlobalLimit should be different. For LocalLimit, limit is just the row count for one partition. But we can't get the number of partitions here, I think. As the actual row number might be quite bigger than the limit, maybe we should set it as None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, maybe we should still separate the stats calculation of GlobalLimit and LocalLimit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we assume limitExpr is foldable? but seems there is no type checking logic for it.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73372 has finished for PR 16696 at commit 5692939.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Feb 24, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73391 has finished for PR 16696 at commit 5692939.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Feb 24, 2017

@cloud-fan @gatorsmile I've updated this pr and also added test cases, please review.

@cloud-fan
Copy link
Contributor

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this test suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remove it, just renamed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use git mv? Then, it will keep the change history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to use git mv now? Do I need to revert to the unchanged version, and git mv, and then do all the changes all over again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. :)

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73824 has started for PR 16696 at commit 5692939.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BasicStatsEstimationSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good name:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test duplicated with the newly added limit test?

Copy link
Contributor Author

@wzhfy wzhfy Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think so, let me remove it.

@wzhfy wzhfy force-pushed the limitEstimation branch from 5692939 to 516b114 Compare March 3, 2017 13:19
@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73845 has finished for PR 16696 at commit 516b114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BasicStatsEstimationSuite extends StatsEstimationTestBase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think the max/min should still be corrected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we make sure max/min values are still there after limit? Otherwise it will be a very loose bound of max/min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what's the strategy here? is a loose bound better than nothing?

Copy link
Contributor Author

@wzhfy wzhfy Mar 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loose bound can lead to significant under-estimation. E.g. a > 50, after local limit the actual range is [40, 60], while max and min in stats are still [0, 60], then the filter factor will be 1/6 instead of 1/2.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 6, 2017

@cloud-fan Does this look good to you now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val attr = ...
vak colStat = ...

@cloud-fan
Copy link
Contributor

LGTM except one minor comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the whole sentence does not have a period. How about rewriting it like?

The output row count of LocalLimit should be the sum of row counts from each partition. However, since the number of partitions is not available here, we just use statistics of the child. Because the distirubtion after a limit operation is unknown, we do not propapage the column stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: val rowCount: BigInt = childStats.rowCount.map(_.min(limit)).getOrElse(limit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: childStats.copy(attributeStats = AttributeMap(Nil))

@gatorsmile
Copy link
Member

LGTM except minor comments.

@wzhfy wzhfy force-pushed the limitEstimation branch from 516b114 to 0c42ea2 Compare March 7, 2017 02:27
@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74060 has finished for PR 16696 at commit 0c42ea2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 9909f6d Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants