-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28280][SQL][PYTHON][TESTS] Convert and port 'group-by.sql' into UDF test base #25098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
|
@HyukjinKwon |
|
if I set: it validates against the golden file's first query, which means when golden file is not generated test suite picks up "--set" configs and results change... So there is a bug here about expected behavior. |
|
Ok I found out what is wrong. The current golden file is the output of the pandas udf testcases which come at the end. Each type of udf overwrites the file: diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
index 97c831aec4..58ed37fd56 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
@@ -37,7 +37,7 @@ struct<udf(a):string,count(udf(b)):bigint>
1 2
2 2
3 3
-nan 2
+null 2
-- !query 4
@@ -101,7 +101,7 @@ struct<udf((a + b)):string,udf(count(b)):string>
3 2
4 2
5 1
-nan 1
+null 1
-- !query 11
@@ -121,7 +121,7 @@ struct<udf(((a + 1) + 1)):string,udf(count(b)):string>
3 2
4 2
5 2
-nan 1
+null 1
-- !query 13
@@ -130,7 +130,7 @@ FROM testData
-- !query 13 schema
struct<skewness(CAST(udf(a) AS DOUBLE)):double,udf(kurtosis(cast(a as double))):string,udf(min(a)):string,max(udf(a)):string,udf(avg(cast(udf(a) as double))):string,udf(var_samp(cast(a as double))):string,stddev_samp(CAST(udf(a) AS DOUBLE)):double,udf(sum(cast(a as bigint))):string,udf(count(a)):string>
-- !query 13 output
--0.2723801058145729 -1.5069204152249134 1 nan 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
+-0.2723801058145729 -1.5069204152249134 1 null 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
-- !query 14
@@ -295,7 +295,7 @@ SELECT udf(every(v)), udf(some(v)), any(v) FROM test_agg WHERE 1 = 0
-- !query 31 schema
struct<udf(every(v)):string,udf(some(v)):string,any(v):boolean>
-- !query 31 output
-None None NULL
+null null NULL
-- !query 32
@@ -303,7 +303,7 @@ SELECT udf(every(v)), some(v), any(v) FROM test_agg WHERE k = 4
-- !query 32 schema
struct<udf(every(v)):string,some(v):boolean,any(v):boolean>
-- !query 32 output
-None NULL NULL
+null NULL NULL
-- !query 33
@@ -311,7 +311,7 @@ SELECT every(v), udf(some(v)), any(v) FROM test_agg WHERE k = 5
-- !query 33 schema
struct<every(v):boolean,udf(some(v)):string,any(v):boolean>
-- !query 33 output
-false True true
+false true true
-- !query 34
@@ -319,11 +319,11 @@ SELECT k, every(v), udf(some(v)), any(v) FROM test_agg GROUP BY k
-- !query 34 schema
struct<k:int,every(v):boolean,udf(some(v)):string,any(v):boolean>
-- !query 34 output
-1 false True true
-2 true True true
-3 false False false
-4 NULL None NULL
-5 false True true
+1 false true true
+2 true true true
+3 false false false
+4 NULL null NULL
+5 false true true
-- !query 35
@@ -356,7 +356,7 @@ GROUP BY k
-- !query 37 schema
struct<k:int,every:string>
-- !query 37 output
-2 True
+2 true
-- !query 38
@@ -432,16 +432,16 @@ SELECT k, udf(udf(v)), some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
-- !query 44 schema
struct<k:int,udf(udf(v)):string,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
-- !query 44 output
-1 False false
-1 True true
-2 True true
-3 False false
-3 None NULL
-4 None NULL
-4 None NULL
-5 False false
-5 None NULL
-5 True true
+1 false false
+1 true true
+2 true true
+3 false false
+3 null NULL
+4 null NULL
+4 null NULL
+5 false false
+5 null NULL
+5 true true
-- !query 45
@@ -474,9 +474,9 @@ SELECT k, udf(max(udf(v))) FROM test_agg GROUP BY k HAVING max(v) = true
-- !query 47 schema
struct<k:int,udf(max(udf(v))):string>
-- !query 47 output
-1 True
-2 True
-5 True
+1 true
+2 true
+5 true
-- !query 48and here is the diff with python udf: diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
index 97c831aec4..487fbc86f7 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
@@ -37,7 +37,7 @@ struct<udf(a):string,count(udf(b)):bigint>
1 2
2 2
3 3
-nan 2
+None 2
-- !query 4
@@ -101,7 +101,7 @@ struct<udf((a + b)):string,udf(count(b)):string>
3 2
4 2
5 1
-nan 1
+None 1
-- !query 11
@@ -121,7 +121,7 @@ struct<udf(((a + 1) + 1)):string,udf(count(b)):string>
3 2
4 2
5 2
-nan 1
+None 1
-- !query 13
@@ -130,7 +130,7 @@ FROM testData
-- !query 13 schema
struct<skewness(CAST(udf(a) AS DOUBLE)):double,udf(kurtosis(cast(a as double))):string,udf(min(a)):string,max(udf(a)):string,udf(avg(cast(udf(a) as double))):string,udf(var_samp(cast(a as double))):string,stddev_samp(CAST(udf(a) AS DOUBLE)):double,udf(sum(cast(a as bigint))):string,udf(count(a)):string>
-- !query 13 output
--0.2723801058145729 -1.5069204152249134 1 nan 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
+-0.2723801058145729 -1.5069204152249134 1 None 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
-- !query 14
|
|
@HyukjinKwon why udf test suite assumes str methods for python and scala for example should return the same value? Should I just make scala udf toString logic compatible and avoid applying the udf to columns with nulls? |
|
Yes, there is the difference about that. I noted it here - #25069 (comment) Making it compatible might be one solution because we're not testing Scala's Another way is that we explicitly wrap it with |
|
As a workaround, maybe we can also wrap the |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this tricky one and keeping focused on each diff.
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
|
@HyukjinKwon ready for review. I used CASTs that work at the end of the day for all udfs. |
This comment has been minimized.
This comment has been minimized.
|
Yea .. I actually tried to avoid those workarounds at all .. sorry for forth and back. I made this PR #25130 |
|
@HyukjinKwon when the PR is merged I will refactor this one. |
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
Outdated
Show resolved
Hide resolved
|
Looks fine in general but let's focus on testing GROUP BY clause with UDFs. |
This comment has been minimized.
This comment has been minimized.
|
@HyukjinKwon sure im on it. |
|
jenkins test this please |
sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
Outdated
Show resolved
Hide resolved
|
@skonto, do you know why we have such diff? -- !query 2
-SELECT COUNT(udf(a)), udf(COUNT(b)) FROM testData
+SELECT COUNT(a), COUNT(b) FROM testData
-- !query 2 schema
-struct<count(udf(a)):bigint,udf(count(b)):string>
+struct<count(a):bigint,count(b):bigint>
-- !query 2 output
-9 7
+7 7 -- !query 3
-SELECT CAST(udf(a) as int), COUNT(udf(b)) FROM testData GROUP BY a
+SELECT a, COUNT(b) FROM testData GROUP BY a
-- !query 3 schema
-struct<CAST(udf(a) AS INT):int,count(udf(b)):bigint>
+struct<a:int,count(b):bigint>
-- !query 3 output
1 2
2 2
-3 3
-NULL 2
+3 2
+NULL 1 -- !query 5
-SELECT COUNT(udf(a)), COUNT(udf(b)) FROM testData GROUP BY udf(a)
+SELECT COUNT(a), COUNT(b) FROM testData GROUP BY a
-- !query 5 schema
-struct<count(udf(a)):bigint,count(udf(b)):bigint>
+struct<count(a):bigint,count(b):bigint>
-- !query 5 output
+0 1
2 2
2 2
-2 2
-3 3
+3 2 -- !query 6
-SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1
+SELECT 'foo', COUNT(a) FROM testData GROUP BY 1
-- !query 6 schema
-struct<foo:string,count(udf(a)):bigint>
+struct<foo:string,count(a):bigint>
-- !query 6 output
-foo 9
+foo 7 -- !query 13
-SELECT SKEWNESS(udf(a)), udf(KURTOSIS(a)), udf(MIN(a)), CAST(MAX(udf(a)) as int), udf(AVG(udf(a))), udf(VARIANCE(a)), STDDEV(udf(a)), udf(SUM(a)), udf(COUNT(a))
+SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
FROM testData
-- !query 13 schema
-struct<skewness(CAST(udf(a) AS DOUBLE)):double,udf(kurtosis(cast(a as double))):string,udf(min(a)):string,CAST(max(udf(a)) AS INT):int,udf(avg(cast(udf(a) as double))):string,udf(var_samp(cast(a as double))):string,stddev_samp(CAST(udf(a) AS DOUBLE)):double,udf(sum(cast(a as bigint))):string,udf(count(a)):string>
+struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS DOUBLE)):double,stddev_samp(CAST(a AS DOUBLE)):double,sum(a):bigint,count(a):bigint>
-- !query 13 output
--0.2723801058145729 -1.5069204152249134 1 NULL 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
+-0.2723801058145729 -1.5069204152249134 1 3 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
-- !query 15
-SELECT a AS k, COUNT(udf(b)) FROM testData GROUP BY k
+SELECT a AS k, COUNT(b) FROM testData GROUP BY k
-- !query 15 schema
-struct<k:int,count(udf(b)):bigint>
+struct<k:int,count(b):bigint>
-- !query 15 output
1 2
2 2
-3 3
-NULL 2
+3 2
+NULL 1 |
|
@HyukjinKwon I will update the diff, because this is before my update... however what I noticed before your PR is that udf created strings changed the count results,eg NULL strings will be counted as separate results and will not be skipped. |
05c1b9e to
6a5da53
Compare
|
@HyukjinKwon I updated the diff the issue I see is: which is expected when you add udf at the |
|
Test build #107845 has finished for PR 25098 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following original group-by.sql, I think this should be SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it could be, both should work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this query is grouping by udf(a + 1), udf(a + 1 + 1) is an expression the analyzer will complain about. (an AnalysisException)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right.. its based on group value. Will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya @HyukjinKwon If I do that scala udf will generate:
SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)
-- !query 12 schema
-struct<>
+struct<(CAST(udf(cast((a + 1) as string)) AS INT) + 1):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 12 output
-org.apache.spark.sql.AnalysisException
-expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
+3 2
+4 2
+5 2
+NULL 1
but then python and pandas udfs will fail with that exception and will over-write the generated file. So then next time I run without SPARK_GENERATE_GOLDEN_FILES=1 scala tests will fail. This does not look good from a first glance. I will try test it outside tests in spark shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one example when I met such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I will create the jira thanks and will update the comment pointing to the jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Jira, also added a pointer to it in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted #25215 to fix it.
|
Test build #107846 has finished for PR 25098 at commit
|
|
Test build #107854 has finished for PR 25098 at commit
|
aedd4ec to
67c381e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't we want to comment out this query before the issue is fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, let's change it to
SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1);as @viirya pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem is that git diff --no-index isn't pretty,
- run without commenting this
- save the diff by
git diff --no-index - manually remove the diff related to this query
- update PR description
- rerun the test after commenting this query back
- push it to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will do the above.
|
Test build #107902 has finished for PR 25098 at commit
|
|
Test build #107904 has finished for PR 25098 at commit
|
|
@skonto, can you address #25098 (comment) comment? then I will get this in. |
|
@HyukjinKwon done. |
|
Test build #108006 has finished for PR 25098 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
This PR adds some tests converted from
group-by.sqlto test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.Diff comparing to 'group-by.sql'
How was this patch tested?
Tested as guided in SPARK-27921.
Verified pandas & pyarrow versions:
From the sql output it seems that sql statements are evaluated correctly given that udf returns a string and may change results as Null will be returned as None and will be counted in returned values.