-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19350] [SQL] Cardinality estimation of Limit and Sample #16696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
62013f5 to
b88fac5
Compare
|
cc @cloud-fan @gatorsmile please review |
|
Test build #71963 has finished for PR 16696 at commit
|
|
Test build #71964 has finished for PR 16696 at commit
|
|
Test build #71988 has finished for PR 16696 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use childStats.rowCount? If childStats.rowCount is less than limit number, I think we should use it instead of limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @wzhfy is just keeping the existing code logics. Sure, we can improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should. Otherwise the rowCount is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We can pick the smaller value between the child node's row count and the limit number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ceil -> ceiling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sampledNumber -> sampledRowCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the stats more accurate, yes, we can use a smaller number between childStats.rowCounts and limit as outputRowCount of getOutputSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to adding a comment above this line:
// rowCount * (overhead + column size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace the above three lines by checkStats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, this is a utility function. We can make it more general by having two expected stats values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename plan2 to childPlan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.
|
Overall looks good to me. : ) Could you add a few more test cases?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the rowCount for LocalLimit and GlobalLimit should be different. For LocalLimit, limit is just the row count for one partition. But we can't get the number of partitions here, I think. As the actual row number might be quite bigger than the limit, maybe we should set it as None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, maybe we should still separate the stats calculation of GlobalLimit and LocalLimit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For limit estimation test cases, we may add a test with limit number greater than a child node's row count. This test can show if we properly select the smaller value between limit number child node's row count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we assume limitExpr is foldable? but seems there is no type checking logic for it.
05fcd81 to
5692939
Compare
|
Test build #73372 has finished for PR 16696 at commit
|
|
retest this please |
|
Test build #73391 has finished for PR 16696 at commit
|
|
@cloud-fan @gatorsmile I've updated this pr and also added test cases, please review. |
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this test suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remove it, just renamed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use git mv? Then, it will keep the change history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to use git mv now? Do I need to revert to the unchanged version, and git mv, and then do all the changes all over again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. :)
|
Test build #73824 has started for PR 16696 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BasicStatsEstimationSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good name:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test duplicated with the newly added limit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think so, let me remove it.
|
Test build #73845 has finished for PR 16696 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think the max/min should still be corrected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we make sure max/min values are still there after limit? Otherwise it will be a very loose bound of max/min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, what's the strategy here? is a loose bound better than nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A loose bound can lead to significant under-estimation. E.g. a > 50, after local limit the actual range is [40, 60], while max and min in stats are still [0, 60], then the filter factor will be 1/6 instead of 1/2.
|
@cloud-fan Does this look good to you now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val attr = ...
vak colStat = ...
|
LGTM except one minor comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the whole sentence does not have a period. How about rewriting it like?
The output row count of LocalLimit should be the sum of row counts from each partition. However, since the number of partitions is not available here, we just use statistics of the child. Because the distirubtion after a limit operation is unknown, we do not propapage the column stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: val rowCount: BigInt = childStats.rowCount.map(_.min(limit)).getOrElse(limit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: childStats.copy(attributeStats = AttributeMap(Nil))
|
LGTM except minor comments. |
|
Test build #74060 has finished for PR 16696 at commit
|
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Before this pr, LocalLimit/GlobalLimit/Sample propagates the same row count and column stats from its child, which is incorrect.
We can get the correct rowCount in Statistics for GlobalLimit/Sample whether cbo is enabled or not.
We don't know the rowCount for LocalLimit because we don't know the partition number at that time. Column stats should not be propagated because we don't know the distribution of columns after Limit or Sample.
How was this patch tested?
Added test cases.