Skip to content

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Introduce new arg centroids for approx_percentile_cont_with_weight and align the signature with approx_percentile_cont

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, API change, updated the docs

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions Changes to functions implementation labels Aug 1, 2025
@liamzwbao liamzwbao force-pushed the issue-16990-percentile-cont branch from 2b2c70a to 141c98f Compare August 1, 2025 02:17
@liamzwbao liamzwbao marked this pull request as ready for review August 1, 2025 02:17
@liamzwbao liamzwbao force-pushed the issue-16990-percentile-cont branch from 141c98f to 7370b2a Compare August 1, 2025 02:18
@liamzwbao liamzwbao force-pushed the issue-16990-percentile-cont branch from 7370b2a to 0cd3d8e Compare August 1, 2025 02:50
Comment on lines 369 to 374
// Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight.
//
// Important: max_size Preservation
// TDigest::merge_digests uses the max_size from the first digest in the iterator.
// By putting self.digest first, we ensure the accumulator's configured max_size
// is preserved rather than being overridden by the new digests' max_size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight.
//
// Important: max_size Preservation
// TDigest::merge_digests uses the max_size from the first digest in the iterator.
// By putting self.digest first, we ensure the accumulator's configured max_size
// is preserved rather than being overridden by the new digests' max_size.
/// Merge new TDigests into this accumulator. Public for `approx_percentile_cont_with_weight`.
///
/// Important: `max_size` Preservation
/// [`TDigest::merge_digests`] uses the `max_size` from the first digest in the iterator.
/// By putting self.digest first, we ensure the accumulator's configured `max_size`
/// is preserved rather than being overridden by the new digests' `max_size`.

Should we make this a real docstring, so that IDEs and rustdoc can see it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior-wise, I wonder if we should change it to "uses the max of max_size over all inputs" instead of "uses the first input". The reason is that orders within the planner and in queries are often not that well-defined, and depending on it easily leads to nondeterministic results that are hard to debug.


```sql
approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
approx_percentile_cont(percentile [, centroids]) WITHIN GROUP (ORDER BY expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that change. I was really confused when I read the docs the first time but saw queries with 1 argument. I think this is easier to understand 👍

Comment on lines 371 to 374
// Important: max_size Preservation
// TDigest::merge_digests uses the max_size from the first digest in the iterator.
// By putting self.digest first, we ensure the accumulator's configured max_size
// is preserved rather than being overridden by the new digests' max_size.
Copy link
Contributor

@jcsherin jcsherin Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ApproxPercentileWithWeightAccumulator::update_batch continues to use DEFAULT_MAX_SIZE to create new TDigests. Shouldn't it instead be updated to use the configured max_size value?

Then merging digests will not override max_size, and the ordering dependency goes away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order of the iteration so DEFAULT_MAX_SIZE no longer needs to change in update_batch. Which approach should we take?

  1. Keep the current approach

    • Use the max_size from the first TDigest.
    • Re-order merges so inner configs aren’t overwritten.
  2. Caller sets max_size (from @jcsherin)

    • Have ApproxPercentileWithWeightAccumulator::update_batch assign a uniform max_size to every TDigest.
    • Roll back the change in this function.
  3. Take the max max_size (from @crepererum)

    • Update TDigest::merge_digests to take the largest max_size among inputs.
    • Safer if callers forget to align configs, but might not be that performant than taking the first config?
    • A very large max_size could override user-defined config?

Would appreciate ideas on different approach.

Copy link
Contributor

@jcsherin jcsherin Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Update TDigest::merge_digests to take the largest max_size among inputs.

It looks like there is an assumption implicit in the merging code. The instances which are being merged are created with the same max_size value.

https://github.com/apache/datafusion/blob/6ea01d13362f33aca5434b18c632fe2f43e60ab9/datafusion/functions-aggregate-common/src/tdigest.rs#L393C9-L393C58

If the above is not true, then the following problems come up as rightly called out here:

  • Safer if callers forget to align configs, but might not be that performant than taking the first config?

  • A very large max_size could override user-defined config?

But if the max_size configuration is guaranteed to remain the same between t-digest instances then we do not have an ordering problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thinking, I think we have to guarantee that in the context of a single approx_percentile_cont_with_weight function call all the partial aggregates are created with the same max_size value.

Therefore the max_size should be user provided value, or the default argument if omitted.

The underlying t-digest sketch algorithm is flexible enough to allow merging digests with different max_size values into one, where the final precision depends on how merge is implemented internally.

Regardless of the algorithm's flexibility, we must define clear semantics at the function call level for predictability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to use same max_size within single function call. Went with options2 to unify the max_size config for all TDigest instances


query TI
SELECT c1, approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
SELECT c1, approx_percentile_cont_with_weight(c2, 0.95, 200) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this test pass if the third argument is 100?

pub const DEFAULT_MAX_SIZE: usize = 100;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 and 200 makes no difference in terms of this test result. I changed it here just to make sure the function with new arg can compile. I can add more tests here tho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Maybe keep the original test intact and then add this test with the centroids argument as a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate substrait Changes to the substrait crate common Related to common crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate spark labels Aug 4, 2025
@liamzwbao liamzwbao force-pushed the issue-16990-percentile-cont branch from 0cbb97f to b5b6386 Compare August 4, 2025 23:03
@github-actions github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Aug 4, 2025
@github-actions github-actions bot removed substrait Changes to the substrait crate common Related to common crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate spark labels Aug 4, 2025
@jcsherin
Copy link
Contributor

jcsherin commented Aug 5, 2025

@liamzwbao

Thanks, the changes LGTM.

FYI, PR #16999 will add back support for the older syntax where the function works without the WITHIN GROUP clause. Since the docs and tests will need to be updated again to reflect this, it's best to wait until that PR is merged.


/// Computes the approximate percentile continuous with weight of a set of numbers
pub fn approx_percentile_cont_with_weight(
order_by: Sort,
Copy link
Contributor

@jcsherin jcsherin Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont_with_weight/fn.approx_percentile_cont_with_weight.html

The first argument has changed from Expr in the current API to Sort.

pub fn approx_percentile_cont_with_weight(
    expression: Expr,
    weight: Expr,
    percentile: Expr,
) -> Expr

Shouldn't the order_by be an Expr?

Copy link
Contributor

@jcsherin jcsherin Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind 👍, I see that you have made the type narrower.

#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
pub struct Sort {
/// The expression to sort on
pub expr: Expr,
/// The direction of the sort
pub asc: bool,
/// Whether to put Nulls before all other data values
pub nulls_first: bool,
}

@jcsherin
Copy link
Contributor

jcsherin commented Aug 5, 2025

Merging from main should fix the CI error.

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@crepererum
Copy link
Contributor

FYI, PR #16999 will add back support for the older syntax where the function works without the WITHIN GROUP clause. Since the docs and tests will need to be updated again to reflect this, it's best to wait until that PR is merged.

I'm merging this PR here since it is ready. The other one needs another round and doesn't touch the docs yet.

@crepererum crepererum merged commit 183ff66 into apache:main Aug 6, 2025
28 checks passed
@liamzwbao liamzwbao deleted the issue-16990-percentile-cont branch August 6, 2025 22:29
hknlof pushed a commit to hknlof/datafusion that referenced this pull request Aug 20, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
crepererum pushed a commit to influxdata/arrow-datafusion that referenced this pull request Aug 25, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
crepererum pushed a commit to influxdata/arrow-datafusion that referenced this pull request Sep 5, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
crepererum pushed a commit to influxdata/arrow-datafusion that referenced this pull request Sep 5, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Oct 6, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Oct 21, 2025
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support "number of centroids" for approx_percentile_cont_with_weight

3 participants