[SPARK-55038][SQL] Fix wrong results for array_agg(DISTINCT) with AQE… #54021
+34
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Sort the output array in CollectList.eval() and CollectSet.eval() using
PhysicalDataType.ordering(child.dataType) to ensure deterministic element ordering. This
follows the same pattern used by CollectTopK.eval().
The change affects two methods in collect.scala:
Golden SQL test result files (group-by.sql.out, scalar-subquery-select.sql.out) are
updated to reflect the now-sorted output.
Why are the changes needed?
When AQE is enabled with spark.sql.objectHashAggregate.sortBased.fallbackThreshold=1,
array_agg(DISTINCT) produces wrong results in correlated subqueries. The root cause:
distinct columns to grouping keys in AggUtils.planAggregateWithOneDistinct.
buffers, causing non-deterministic element order across runs.
compare array results for equality.
elements in different orders are treated as unequal, causing the join to produce zero
matches instead of the correct result.
Both collect_list and collect_set explicitly document that their output order is
non-deterministic, so sorting does not violate their contract.
Does this PR introduce any user-facing change?
Yes. collect_list and collect_set now return elements in sorted order instead of
insertion order. Both functions already document that their output order is
non-deterministic and should not be relied upon, so this is a bug fix rather than a
behavior change in terms of the documented API contract.
Before: SELECT collect_list(col) FROM VALUES (1), (2), (1) AS tab(col) → [1,2,1]
After: SELECT collect_list(col) FROM VALUES (1), (2), (1) AS tab(col) → [1,1,2]
How was this patch tested?
consistent equality results with AQE enabled and sortBased.fallbackThreshold=1.
updated expected output.
Was this patch authored or co-authored using generative AI tooling?
No