-
Notifications
You must be signed in to change notification settings - Fork 273
feat: supports array_distinct #1923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1923 +/- ##
============================================
+ Coverage 56.12% 58.85% +2.72%
- Complexity 976 1141 +165
============================================
Files 119 130 +11
Lines 11743 12858 +1115
Branches 2251 2393 +142
============================================
+ Hits 6591 7567 +976
- Misses 4012 4073 +61
- Partials 1140 1218 +78 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The CI test failure is unrelated to changes in this PR and is now fixed in main branch |
| } | ||
| } | ||
|
|
||
| test("array_distinct") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a case with nulls(more than one) in the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. While I was testing the nulls, I found out that datafusion's array_distinct doesn't behave the same as spark's array_distinct. This is because datafusion first sorts then removes duplicates while spark preserves the original order. Therefore I changed the code to implement IncompatExpr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add array_distinct to the list of supported array expressions in docs/source/user-guide/expressions.md and add a note about the compatibility issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
6f2c4be to
69d169d
Compare
|
@andygrove @parthchandra @comphead Could you please take another look? The CI failure doesn't seem to be related to this PR. |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @drexler-sky for the nice contribution
I think it is LGTM, 1 more thing please add tests with empty array.
Sometimes DataFusion and Spark treats differently array functions with empty array
I tested array_distinct with an empty array. Spark uses an alias, |
In this case, Spark is replacing the |
|
@andygrove Thanks for the suggestion! I have tried However, Spark still appears to replace the second array_distinct with |
94e577a to
d8c0b1e
Compare
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. pending ci
|
@drexler-sky what if ? |
I tried this, but it didn't work for me. |
|
I stepped into the code. The reason Comet falls back to Spark for the literal [] is that it goes to https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L865. Maybe we can log a separate issue to address the supported DataType problem for complex types. |
That's a good find (cc @comphead) edit: We do have issue #1929 for tracking this |
|
Thanks @drexler-sky @comphead @parthchandra |
Which issue does this PR close?
Closes #.
Rationale for this change
adds support for array_distinct
What changes are included in this PR?
How are these changes tested?
new test case