feat: supports array_distinct #1923

drexler-sky · 2025-06-23T04:39:30Z

Which issue does this PR close?

Closes #.

Rationale for this change

adds support for array_distinct

What changes are included in this PR?

How are these changes tested?

new test case

codecov-commenter · 2025-06-23T05:58:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.85%. Comparing base (f09f8af) to head (d8c0b1e).
Report is 285 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1923      +/-   ##
============================================
+ Coverage     56.12%   58.85%   +2.72%     
- Complexity      976     1141     +165     
============================================
  Files           119      130      +11     
  Lines         11743    12858    +1115     
  Branches       2251     2393     +142     
============================================
+ Hits           6591     7567     +976     
- Misses         4012     4073      +61     
- Partials       1140     1218      +78

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-06-23T16:17:39Z

The CI test failure is unrelated to changes in this PR and is now fixed in main branch

parthchandra · 2025-06-23T16:26:50Z

spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala

    }
  }

+  test("array_distinct") {


Can we add a case with nulls(more than one) in the array.

Thanks for the comment. While I was testing the nulls, I found out that datafusion's array_distinct doesn't behave the same as spark's array_distinct. This is because datafusion first sorts then removes duplicates while spark preserves the original order. Therefore I changed the code to implement IncompatExpr.

Could you add array_distinct to the list of supported array expressions in docs/source/user-guide/expressions.md and add a note about the compatibility issue.

spark/src/main/scala/org/apache/comet/serde/arrays.scala

drexler-sky · 2025-06-24T06:33:32Z

@andygrove @parthchandra @comphead Could you please take another look? The CI failure doesn't seem to be related to this PR.

comphead

thanks @drexler-sky for the nice contribution
I think it is LGTM, 1 more thing please add tests with empty array.

Sometimes DataFusion and Spark treats differently array functions with empty array

drexler-sky · 2025-06-25T01:46:11Z

...1 more thing please add tests with empty array.

I tested array_distinct with an empty array.

SELECT array_distinct(array()) FROM t1;

== Optimized Logical Plan ==
Project [[] AS array_distinct(array())#240]
+- Relation [_1#121,_2#122,_3#123,_4#124,_5#125L,_6#126,_7#127,_8#128,_9#129,_10#130,_11#131L,_12#132,_13#133,_14#134,_15#135,_16#136,_17#137,_18#138,_19#139,_20#140,_21#141,_id#142] parquet

== Physical Plan ==
*(1) Project [[] AS array_distinct(array())#240]
+- *(1) CometColumnarToRow
   +- CometScan parquet [] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/2r/znvj4hhd3t1cp22pmw4m3h_40000gn/T/spark-f9..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Spark uses an alias, [] AS array_distinct(array()) , so it doesn't reach case _: ArrayDistinct => convert(CometArrayDistinct

andygrove · 2025-06-25T13:31:01Z

...1 more thing please add tests with empty array.

I tested array_distinct with an empty array.

SELECT array_distinct(array()) FROM t1;

== Optimized Logical Plan ==
Project [[] AS array_distinct(array())#240]
+- Relation [_1#121,_2#122,_3#123,_4#124,_5#125L,_6#126,_7#127,_8#128,_9#129,_10#130,_11#131L,_12#132,_13#133,_14#134,_15#135,_16#136,_17#137,_18#138,_19#139,_20#140,_21#141,_id#142] parquet

== Physical Plan ==
*(1) Project [[] AS array_distinct(array())#240]
+- *(1) CometColumnarToRow
   +- CometScan parquet [] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/2r/znvj4hhd3t1cp22pmw4m3h_40000gn/T/spark-f9..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Spark uses an alias, [] AS array_distinct(array()) , so it doesn't reach case _: ArrayDistinct => convert(CometArrayDistinct

In this case, Spark is replacing the array_distinct expression with a literal at planning time. To test with an empty array you would need to force this to happen at query execution time. You can do this using a CASE WHEN expression, similar to other tests in this PR.

drexler-sky · 2025-06-25T21:48:35Z

@andygrove Thanks for the suggestion! I have tried

       checkSparkAnswerAndOperator(spark.sql("""
            SELECT array_distinct(
              CASE WHEN _2 = _3
                  THEN array(_4)
                  ELSE array()
              END
            )
            FROM t1
          """))

However, Spark still appears to replace the second array_distinct with [].

== Physical Plan ==
*(1) Project [CASE WHEN (cast(_2#1 as smallint) = _3#2) THEN array_distinct(array(_4#3)) ELSE [] END AS array_distinct(CASE WHEN (_2 = _3) THEN array(_4) ELSE array() END)#44]
+- *(1) CometColumnarToRow
   +- CometScan parquet [_2#1,_3#2,_4#3] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/2r/znvj4hhd3t1cp22pmw4m3h_40000gn/T/spark-39..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_2:tinyint,_3:smallint,_4:int>

parthchandra

lgtm. pending ci

comphead · 2025-06-26T01:07:04Z

@drexler-sky what if

    "spark.sql.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding",

?

drexler-sky · 2025-06-26T05:02:49Z

"spark.sql.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding"

I tried this, but it didn't work for me.

drexler-sky · 2025-06-26T05:13:26Z

I stepped into the code. The reason Comet falls back to Spark for the literal [] is that it goes to https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L865. Maybe we can log a separate issue to address the supported DataType problem for complex types.

andygrove · 2025-06-26T12:40:40Z

I stepped into the code. The reason Comet falls back to Spark for the literal [] is that it goes to https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L865. Maybe we can log a separate issue to address the supported DataType problem for complex types.

That's a good find (cc @comphead)

edit: We do have issue #1929 for tracking this

andygrove · 2025-06-26T13:29:57Z

Thanks @drexler-sky @comphead @parthchandra

parthchandra reviewed Jun 23, 2025

View reviewed changes

comphead reviewed Jun 24, 2025

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/arrays.scala Show resolved Hide resolved

drexler-sky force-pushed the array_distinct branch from 6f2c4be to 69d169d Compare June 24, 2025 03:06

andygrove mentioned this pull request Jun 24, 2025

feat: support array_distinct #1306

Closed

comphead approved these changes Jun 24, 2025

View reviewed changes

andygrove mentioned this pull request Jun 25, 2025

array_contains falls back to Spark in case of Array literal #1929

Closed

drexler-sky added 4 commits June 25, 2025 15:06

feat: supports array_distinct

eb4fa6d

implemented IncompatExpr

f62b334

fix

5afe2bc

add doc

d8c0b1e

drexler-sky force-pushed the array_distinct branch from 94e577a to d8c0b1e Compare June 25, 2025 22:19

parthchandra approved these changes Jun 25, 2025

View reviewed changes

andygrove merged commit 235b69d into apache:main Jun 26, 2025
96 checks passed

comphead mentioned this pull request Jun 26, 2025

EPIC: Support Literal with nested types #1937

Open

4 tasks

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

feat: supports array_distinct (apache#1923)

bd39ca8

feat: supports array_distinct #1923

feat: supports array_distinct #1923

Uh oh!

Conversation

drexler-sky commented Jun 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jun 23, 2025

Uh oh!

parthchandra Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

drexler-sky Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

drexler-sky Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drexler-sky commented Jun 24, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

drexler-sky commented Jun 25, 2025

Uh oh!

andygrove commented Jun 25, 2025

Uh oh!

drexler-sky commented Jun 25, 2025

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

comphead commented Jun 26, 2025

Uh oh!

drexler-sky commented Jun 26, 2025

Uh oh!

drexler-sky commented Jun 26, 2025

Uh oh!

andygrove commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andygrove commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jun 23, 2025 •

edited

Loading

andygrove commented Jun 26, 2025 •

edited

Loading