fix: Support varying quantile values per group in group_by aggregation#25606
fix: Support varying quantile values per group in group_by aggregation#25606wtn wants to merge 3 commits intopola-rs:mainfrom
Conversation
1269250 to
ea1dbcd
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #25606 +/- ##
========================================
Coverage 81.43% 81.43%
========================================
Files 1801 1801
Lines 246750 246930 +180
Branches 3081 3081
========================================
+ Hits 200936 201084 +148
- Misses 45028 45060 +32
Partials 786 786 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
652fb47 to
2fffe21
Compare
|
@orlp @alexander-beedie @ritchie46 @reswqa @c-peters @MarcoGorelli can we get the review here, we have alot of duplicate issues arising for this one. thanks |
8251863 to
9b2f5dc
Compare
| return None; | ||
| } | ||
| let take = { ca.take_unchecked(idx) }; | ||
| take._quantile(quantile, method).unwrap_unchecked() |
There was a problem hiding this comment.
I know this was in the original code, but can you make this just a normal unwrap()?
There was a problem hiding this comment.
Changed unwrap_unchecked() → unwrap().
| let ca = ca.rechunk(); | ||
| agg_helper_idx_on_all_with_idx::<K, _>(groups, |(group_idx, idx)| { | ||
| debug_assert!(idx.len() <= ca.len()); | ||
| if idx.is_empty() { |
There was a problem hiding this comment.
Why not match len here for a single-element fast path like below?
There was a problem hiding this comment.
Added single-element fast path with match idx.len().
| assert result["group"].to_list() == ["a", "b"] | ||
|
|
||
|
|
||
| def test_quantile_varying_by_group_float32() -> None: |
There was a problem hiding this comment.
I feel like these are way too many individual tests. You can search around for pytest.mark.parametrize to see how to parametrize the tests. You can verify against a naive Python implementation which groups into lists then manually calls quantile on each list with a specific quantile.
There was a problem hiding this comment.
Consolidated into test_group_by_varying_quantile, parametrized over dtype and method, with numpy reference verification.
0c543cc to
cf61dd8
Compare
|
OK, I've pushed my changes. 🏓 |
eb2a7b3 to
5f80e32
Compare
da3b78e to
e8a3b81
Compare
3df53e3 to
2027e97
Compare
65d114c to
32cd2d3
Compare
c40ea73 to
fc86da2
Compare
Co-authored-by: Claude <noreply@anthropic.com>
Fixes #20951 and its duplicate #25888.
When using
group_by()with a quantile parameter that varies per group (e.g.,pl.col.quantile.first()), all groups incorrectly received the same quantile value instead of each group using its own.Reproduction
Cause
AggQuantileExpr::evaluate_on_groups()always calledget_quantile()which evaluates the quantile expression against the full dataframe, returning a single scalar. This worked for literal quantile values but failed when the quantile expression varied per group (e.g.,first()aggregation).Fix
Added
agg_varying_quantilewhich accepts a slice of quantile values (one per group) and computes quantile per group using the existing aggregation helpers.polars-core changes:
agg_helper_idx_on_all_with_idxand_agg_helper_slice_with_idxhelpers that pass the group index to closuresagg_varying_quantile_genericthat iterates over groups with their corresponding quantile valuesagg_varying_quantilemethods toFloat32Chunked,Float64Chunked, integerChunkedArray,Series, andColumnpolars-expr changes:
AggQuantileExpr::evaluate_on_groups()now detects whether the quantile is uniform (literal/scalar) or varies per group, and dispatches to the appropriate path