Support `centroids` config for `approx_percentile_cont_with_weight` #17003

liamzwbao · 2025-08-01T02:13:45Z

Which issue does this PR close?

Closes Support "number of centroids" for approx_percentile_cont_with_weight #16990.

Rationale for this change

What changes are included in this PR?

Introduce new arg centroids for approx_percentile_cont_with_weight and align the signature with approx_percentile_cont

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, API change, updated the docs

crepererum · 2025-08-01T09:41:17Z

datafusion/functions-aggregate/src/approx_percentile_cont.rs

+    // Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight.
+    //
+    // Important: max_size Preservation
+    // TDigest::merge_digests uses the max_size from the first digest in the iterator.
+    // By putting self.digest first, we ensure the accumulator's configured max_size
+    // is preserved rather than being overridden by the new digests' max_size.


Suggested change

// Merge new TDigests into this accumulator. Public for approx_percentile_cont_with_weight.

//

// Important: max_size Preservation

// TDigest::merge_digests uses the max_size from the first digest in the iterator.

// By putting self.digest first, we ensure the accumulator's configured max_size

// is preserved rather than being overridden by the new digests' max_size.

/// Merge new TDigests into this accumulator. Public for `approx_percentile_cont_with_weight`.

///

/// Important: `max_size` Preservation

/// [`TDigest::merge_digests`] uses the `max_size` from the first digest in the iterator.

/// By putting self.digest first, we ensure the accumulator's configured `max_size`

/// is preserved rather than being overridden by the new digests' `max_size`.

Should we make this a real docstring, so that IDEs and rustdoc can see it?

Behavior-wise, I wonder if we should change it to "uses the max of max_size over all inputs" instead of "uses the first input". The reason is that orders within the planner and in queries are often not that well-defined, and depending on it easily leads to nondeterministic results that are hard to debug.

crepererum · 2025-08-01T09:44:33Z

docs/source/user-guide/sql/aggregate_functions.md


 ```sql
-approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
+approx_percentile_cont(percentile [, centroids]) WITHIN GROUP (ORDER BY expression)


I like that change. I was really confused when I read the docs the first time but saw queries with 1 argument. I think this is easier to understand 👍

jcsherin · 2025-08-01T10:09:23Z

datafusion/functions-aggregate/src/approx_percentile_cont.rs

+    // Important: max_size Preservation
+    // TDigest::merge_digests uses the max_size from the first digest in the iterator.
+    // By putting self.digest first, we ensure the accumulator's configured max_size
+    // is preserved rather than being overridden by the new digests' max_size.


The ApproxPercentileWithWeightAccumulator::update_batch continues to use DEFAULT_MAX_SIZE to create new TDigests. Shouldn't it instead be updated to use the configured max_size value?

Then merging digests will not override max_size, and the ordering dependency goes away.

I changed the order of the iteration so DEFAULT_MAX_SIZE no longer needs to change in update_batch. Which approach should we take?

Keep the current approach

Use the max_size from the first TDigest.

Re-order merges so inner configs aren’t overwritten.

Caller sets max_size (from @jcsherin)

Have ApproxPercentileWithWeightAccumulator::update_batch assign a uniform max_size to every TDigest.

Roll back the change in this function.

Take the max max_size (from @crepererum)

Update TDigest::merge_digests to take the largest max_size among inputs.

Safer if callers forget to align configs, but might not be that performant than taking the first config?

A very large max_size could override user-defined config?

Would appreciate ideas on different approach.

3. Update TDigest::merge_digests to take the largest max_size among inputs.

It looks like there is an assumption implicit in the merging code. The instances which are being merged are created with the same max_size value.

https://github.com/apache/datafusion/blob/6ea01d13362f33aca5434b18c632fe2f43e60ab9/datafusion/functions-aggregate-common/src/tdigest.rs#L393C9-L393C58

If the above is not true, then the following problems come up as rightly called out here:

Safer if callers forget to align configs, but might not be that performant than taking the first config?

A very large max_size could override user-defined config?

But if the max_size configuration is guaranteed to remain the same between t-digest instances then we do not have an ordering problem.

After some more thinking, I think we have to guarantee that in the context of a single approx_percentile_cont_with_weight function call all the partial aggregates are created with the same max_size value.

Therefore the max_size should be user provided value, or the default argument if omitted.

The underlying t-digest sketch algorithm is flexible enough to allow merging digests with different max_size values into one, where the final precision depends on how merge is implemented internally.

Regardless of the algorithm's flexibility, we must define clear semantics at the function call level for predictability.

Makes sense to use same max_size within single function call. Went with options2 to unify the max_size config for all TDigest instances

jcsherin · 2025-08-01T10:20:00Z

datafusion/sqllogictest/test_files/aggregate.slt


 query TI
-SELECT c1, approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
+SELECT c1, approx_percentile_cont_with_weight(c2, 0.95, 200) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1


Shouldn't this test pass if the third argument is 100?

datafusion/datafusion/functions-aggregate-common/src/tdigest.rs

Line 40 in e718c1a

pub const DEFAULT_MAX_SIZE: usize = 100;

100 and 200 makes no difference in terms of this test result. I changed it here just to make sure the function with new arg can compile. I can add more tests here tho

That makes sense. Maybe keep the original test intact and then add this test with the centroids argument as a new one?

jcsherin · 2025-08-05T08:54:48Z

@liamzwbao

Thanks, the changes LGTM.

FYI, PR #16999 will add back support for the older syntax where the function works without the WITHIN GROUP clause. Since the docs and tests will need to be updated again to reflect this, it's best to wait until that PR is merged.

jcsherin · 2025-08-05T08:58:57Z

datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs


+/// Computes the approximate percentile continuous with weight of a set of numbers
+pub fn approx_percentile_cont_with_weight(
+    order_by: Sort,


https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont_with_weight/fn.approx_percentile_cont_with_weight.html

The first argument has changed from Expr in the current API to Sort.

pub fn approx_percentile_cont_with_weight( expression: Expr, weight: Expr, percentile: Expr, ) -> Expr

Shouldn't the order_by be an Expr?

Nevermind 👍, I see that you have made the type narrower.

datafusion/datafusion/expr/src/expr.rs

Lines 903 to 911 in c6d5520

#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]

pub struct Sort {

/// The expression to sort on

pub expr: Expr,

/// The direction of the sort

pub asc: bool,

/// Whether to put Nulls before all other data values

pub nulls_first: bool,

}

jcsherin · 2025-08-05T09:07:01Z

Merging from main should fix the CI error.

crepererum

thank you!

crepererum · 2025-08-06T10:00:09Z

FYI, PR #16999 will add back support for the older syntax where the function works without the WITHIN GROUP clause. Since the docs and tests will need to be updated again to reflect this, it's best to wait until that PR is merged.

I'm merging this PR here since it is ready. The other one needs another round and doesn't touch the docs yet.

…pache#17003) * Support centroids config for `approx_percentile_cont_with_weight` * Match two functions' signature * Update docs * Address comments and unify centroids config

liamzwbao added 2 commits July 31, 2025 22:11

Support centroids config for approx_percentile_cont_with_weight

7d0962c

Match two functions' signature

44a4fd9

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions Changes to functions implementation labels Aug 1, 2025

liamzwbao force-pushed the issue-16990-percentile-cont branch from 2b2c70a to 141c98f Compare August 1, 2025 02:17

liamzwbao marked this pull request as ready for review August 1, 2025 02:17

liamzwbao force-pushed the issue-16990-percentile-cont branch from 141c98f to 7370b2a Compare August 1, 2025 02:18

Update docs

0cd3d8e

liamzwbao force-pushed the issue-16990-percentile-cont branch from 7370b2a to 0cd3d8e Compare August 1, 2025 02:50

crepererum reviewed Aug 1, 2025

View reviewed changes

jcsherin reviewed Aug 1, 2025

View reviewed changes

Address comments and unify centroids config

b5b6386

liamzwbao force-pushed the issue-16990-percentile-cont branch from 0cbb97f to b5b6386 Compare August 4, 2025 23:03

github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Aug 4, 2025

github-actions bot removed substrait Changes to the substrait crate common Related to common crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate spark labels Aug 4, 2025

Merge branch 'main' into issue-16990-percentile-cont

ef738ae

liamzwbao requested review from crepererum and jcsherin August 4, 2025 23:11

jcsherin reviewed Aug 5, 2025

View reviewed changes

Merge branch 'main' into issue-16990-percentile-cont

a1691ff

crepererum approved these changes Aug 6, 2025

View reviewed changes

crepererum merged commit 183ff66 into apache:main Aug 6, 2025
28 checks passed

liamzwbao deleted the issue-16990-percentile-cont branch August 6, 2025 22:29

crepererum mentioned this pull request Aug 26, 2025

Patched DF 49.0.1 (take 1) influxdata/arrow-datafusion#72

Closed

crepererum mentioned this pull request Sep 5, 2025

Patched DF 49.0.2 (take 1) influxdata/arrow-datafusion#73

Closed

	#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
	pub struct Sort {
	/// The expression to sort on
	pub expr: Expr,
	/// The direction of the sort
	pub asc: bool,
	/// Whether to put Nulls before all other data values
	pub nulls_first: bool,
	}

Support centroids config for approx_percentile_cont_with_weight #17003

Support centroids config for approx_percentile_cont_with_weight #17003

Uh oh!

Conversation

liamzwbao commented Aug 1, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin commented Aug 5, 2025

Uh oh!

jcsherin Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin commented Aug 5, 2025

Uh oh!

crepererum left a comment

Choose a reason for hiding this comment

Uh oh!

crepererum commented Aug 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support `centroids` config for `approx_percentile_cont_with_weight` #17003

Support `centroids` config for `approx_percentile_cont_with_weight` #17003

jcsherin Aug 1, 2025 •

edited

Loading

jcsherin Aug 1, 2025 •

edited

Loading

jcsherin Aug 5, 2025 •

edited

Loading

jcsherin Aug 5, 2025 •

edited

Loading