-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support centroids config for approx_percentile_cont_with_weight
#17003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2b2c70a to
141c98f
Compare
141c98f to
7370b2a
Compare
7370b2a to
0cd3d8e
Compare
| // Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight. | ||
| // | ||
| // Important: max_size Preservation | ||
| // TDigest::merge_digests uses the max_size from the first digest in the iterator. | ||
| // By putting self.digest first, we ensure the accumulator's configured max_size | ||
| // is preserved rather than being overridden by the new digests' max_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight. | |
| // | |
| // Important: max_size Preservation | |
| // TDigest::merge_digests uses the max_size from the first digest in the iterator. | |
| // By putting self.digest first, we ensure the accumulator's configured max_size | |
| // is preserved rather than being overridden by the new digests' max_size. | |
| /// Merge new TDigests into this accumulator. Public for `approx_percentile_cont_with_weight`. | |
| /// | |
| /// Important: `max_size` Preservation | |
| /// [`TDigest::merge_digests`] uses the `max_size` from the first digest in the iterator. | |
| /// By putting self.digest first, we ensure the accumulator's configured `max_size` | |
| /// is preserved rather than being overridden by the new digests' `max_size`. |
Should we make this a real docstring, so that IDEs and rustdoc can see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior-wise, I wonder if we should change it to "uses the max of max_size over all inputs" instead of "uses the first input". The reason is that orders within the planner and in queries are often not that well-defined, and depending on it easily leads to nondeterministic results that are hard to debug.
|
|
||
| ```sql | ||
| approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression) | ||
| approx_percentile_cont(percentile [, centroids]) WITHIN GROUP (ORDER BY expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that change. I was really confused when I read the docs the first time but saw queries with 1 argument. I think this is easier to understand 👍
| // Important: max_size Preservation | ||
| // TDigest::merge_digests uses the max_size from the first digest in the iterator. | ||
| // By putting self.digest first, we ensure the accumulator's configured max_size | ||
| // is preserved rather than being overridden by the new digests' max_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ApproxPercentileWithWeightAccumulator::update_batch continues to use DEFAULT_MAX_SIZE to create new TDigests. Shouldn't it instead be updated to use the configured max_size value?
Then merging digests will not override max_size, and the ordering dependency goes away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the order of the iteration so DEFAULT_MAX_SIZE no longer needs to change in update_batch. Which approach should we take?
-
Keep the current approach
- Use the
max_sizefrom the firstTDigest. - Re-order merges so inner configs aren’t overwritten.
- Use the
-
Caller sets
max_size(from @jcsherin)- Have
ApproxPercentileWithWeightAccumulator::update_batchassign a uniformmax_sizeto everyTDigest. - Roll back the change in this function.
- Have
-
Take the max
max_size(from @crepererum)- Update
TDigest::merge_digeststo take the largestmax_sizeamong inputs. - Safer if callers forget to align configs, but might not be that performant than taking the first config?
- A very large
max_sizecould override user-defined config?
- Update
Would appreciate ideas on different approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Update
TDigest::merge_digeststo take the largestmax_sizeamong inputs.
It looks like there is an assumption implicit in the merging code. The instances which are being merged are created with the same max_size value.
If the above is not true, then the following problems come up as rightly called out here:
Safer if callers forget to align configs, but might not be that performant than taking the first config?
A very large
max_sizecould override user-defined config?
But if the max_size configuration is guaranteed to remain the same between t-digest instances then we do not have an ordering problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thinking, I think we have to guarantee that in the context of a single approx_percentile_cont_with_weight function call all the partial aggregates are created with the same max_size value.
Therefore the max_size should be user provided value, or the default argument if omitted.
The underlying t-digest sketch algorithm is flexible enough to allow merging digests with different max_size values into one, where the final precision depends on how merge is implemented internally.
Regardless of the algorithm's flexibility, we must define clear semantics at the function call level for predictability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to use same max_size within single function call. Went with options2 to unify the max_size config for all TDigest instances
|
|
||
| query TI | ||
| SELECT c1, approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1 | ||
| SELECT c1, approx_percentile_cont_with_weight(c2, 0.95, 200) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this test pass if the third argument is 100?
| pub const DEFAULT_MAX_SIZE: usize = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 and 200 makes no difference in terms of this test result. I changed it here just to make sure the function with new arg can compile. I can add more tests here tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Maybe keep the original test intact and then add this test with the centroids argument as a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
0cbb97f to
b5b6386
Compare
|
Thanks, the changes LGTM. FYI, PR #16999 will add back support for the older syntax where the function works without the WITHIN GROUP clause. Since the docs and tests will need to be updated again to reflect this, it's best to wait until that PR is merged. |
|
|
||
| /// Computes the approximate percentile continuous with weight of a set of numbers | ||
| pub fn approx_percentile_cont_with_weight( | ||
| order_by: Sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first argument has changed from Expr in the current API to Sort.
pub fn approx_percentile_cont_with_weight(
expression: Expr,
weight: Expr,
percentile: Expr,
) -> ExprShouldn't the order_by be an Expr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind 👍, I see that you have made the type narrower.
datafusion/datafusion/expr/src/expr.rs
Lines 903 to 911 in c6d5520
| #[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)] | |
| pub struct Sort { | |
| /// The expression to sort on | |
| pub expr: Expr, | |
| /// The direction of the sort | |
| pub asc: bool, | |
| /// Whether to put Nulls before all other data values | |
| pub nulls_first: bool, | |
| } |
|
Merging from main should fix the CI error. |
crepererum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
I'm merging this PR here since it is ready. The other one needs another round and doesn't touch the docs yet. |
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config
Which issue does this PR close?
approx_percentile_cont_with_weight#16990.Rationale for this change
What changes are included in this PR?
Introduce new arg
centroidsforapprox_percentile_cont_with_weightand align the signature withapprox_percentile_contAre these changes tested?
Yes
Are there any user-facing changes?
Yes, API change, updated the docs