[SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression #25215

viirya · 2019-07-21T03:09:10Z

What changes were proposed in this pull request?

When PythonUDF is used in group by, and it is also in aggregate expression, like

SELECT pyUDF(a + 1), COUNT(b) FROM testData GROUP BY pyUDF(a + 1)

It causes analysis exception in CheckAnalysis, like

org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function.

First, CheckAnalysis can't check semantic equality between PythonUDFs.
Second, even we make it possible, runtime exception will be thrown

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF1#8615
...
Cause: java.lang.RuntimeException: Couldn't find pythonUDF1#8615 in [cast(pythonUDF0#8614 as int)#8617,count(b#8599)#8607L]

The cause is, ExtractPythonUDFs extracts both PythonUDFs in group by and aggregate expression. The PythonUDFs are two different aliases now in the logical aggregate. In runtime, we can't bind the resulting expression in aggregate to its grouping and aggregate attributes.

This patch proposes a rule ExtractGroupingPythonUDFFromAggregate to extract PythonUDFs in group by and evaluate them before aggregate. We replace the group by PythonUDF in aggregate expression with aliased result.

The query plan of query SELECT pyUDF(a + 1), pyUDF(COUNT(b)) FROM testData GROUP BY pyUDF(a + 1), like

== Optimized Logical Plan ==
Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
   +- Aggregate [cast(groupingPythonUDF#8614 as int)], [cast(groupingPythonUDF#8614 as int) AS CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, count(b#8599) AS agg#8613L]
      +- Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
         +- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
            +- LocalRelation [a#8598, b#8599]

== Physical Plan ==
*(3) Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
   +- *(2) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int)#8617], functions=[count(b#8599)], output=[CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, agg#8613L])
      +- Exchange hashpartitioning(cast(groupingPythonUDF#8614 as int)#8617, 5), true
         +- *(1) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int) AS cast(groupingPythonUDF#8614 as int)#8617], functions=[partial_count(b#8599)], output=[cast(groupingPythonUDF#8614 as int)#8617, count#8619L])
            +- *(1) Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
               +- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
                  +- LocalTableScan [a#8598, b#8599]

How was this patch tested?

Added tests.

viirya · 2019-07-21T07:29:27Z

* checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 24] do not match the length of object [0]

SparkR CRAN feasibility check (SPARK-24152) fails again..Emailed to CRAN for help.

viirya · 2019-07-21T09:02:59Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

HyukjinKwon · 2019-07-22T07:37:26Z

retest this please

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

SparkQA · 2019-07-22T11:52:18Z

Test build #108001 has finished for PR 25215 at commit 33a5e0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PythonUDFSuite extends QueryTest with SharedSQLContext

SparkQA · 2019-07-24T20:40:18Z

Test build #108114 has finished for PR 25215 at commit 3b3472e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-07-27T05:59:24Z

cc @cloud-fan and @mgaido91 too

cloud-fan · 2019-07-29T06:02:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+ * before aggregate.
+ * This must be executed after `ExtractPythonUDFFromAggregate` rule and before `ExtractPythonUDFs`.
+ */
+object ExtractGroupingPythonUDFFromAggregate extends Rule[LogicalPlan] {


what's the difference between this rule and ExtractPythonUDFFromAggregate?

ExtractPythonUDFFromAggregate pulls out Python UDFs which have aggregate expression or grouping key as input, like udf(sum(c)), and Python UDFs which have no input. Those UDFs pulled out are evaluated after aggregate.

This rule, ExtractGroupingPythonUDFFromAggregate, pulls out Python UDFs which are used in grouping keys, like SELECT count(*) FROM table GROUP BY udf(id). This kind of Python UDF is evaluated before aggregate.

cloud-fan · 2019-07-30T12:23:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

+  override lazy val canonicalized: Expression = {
+    val canonicalizedChildren = children.map(_.canonicalized)
+    // `resultId` can be seen as cosmetic variation in PythonUDF, as it doesn't affect the result.
+    Canonicalize.execute(this.copy(resultId = ExprId(-1)).withNewChildren(canonicalizedChildren))


do we still need to run Canonicalize.execute?

this can be saved, yes.

cloud-fan · 2019-07-30T12:30:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        // in its arguments. Such PythonUDF was pull out by ExtractPythonUDFFromAggregate, too.
+        case p: PythonUDF if p.udfDeterministic =>
+          val canonicalized = p.canonicalized.asInstanceOf[PythonUDF]
+          attributeMap.get(canonicalized).map(_.toAttribute).getOrElse(p)


We can put attributes in attributeMap, instead of Alias.

cloud-fan · 2019-07-30T12:32:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala

+      .agg(scalaTestUDF(base("a") + 1), scalaTestUDF(count(base("b"))))
+    val df2 = base.groupBy(pythonTestUDF(base("a") + 1))
+      .agg(pythonTestUDF(base("a") + 1), pythonTestUDF(count(base("b"))))
+    checkAnswer(df, df2)


can we create a test case for each of these checks? We can move the scalaTestUDF, pythonTestUDF and base to the class body.

cloud-fan · 2019-08-02T06:05:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+              "in grouping expression")
+            val alias = Alias(p, "groupingPythonUDF")()
+            projList += alias
+            attributeMap += ((p.canonicalized.asInstanceOf[PythonUDF], alias.toAttribute))


nit: if a python udf is handled before, this will replace it with a new attribute?

ok. let's replace it. since they are deterministic, it should be fine.

cloud-fan · 2019-08-02T06:54:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        // in its arguments. Such PythonUDF was pull out by ExtractPythonUDFFromAggregate, too.
+        case p: PythonUDF if p.udfDeterministic =>
+          val canonicalized = p.canonicalized.asInstanceOf[PythonUDF]
+          attributeMap.getOrElse(canonicalized, p)


nit: if we can't replace python udf here, we can't run the query. Maybe it's better to do attributeMap.get(...).getOrElse(fail)?

if we can't replace python udf here, it still can run. Like:

val df = base.groupBy(pythonTestUDF(base("a"))) .agg(sum(pythonTestUDF(base("a") + 1)))

ExtractPythonUDFs will extract such udfs.

SparkQA · 2019-08-02T07:05:02Z

Test build #108546 has finished for PR 25215 at commit c852142.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-02T07:05:02Z

Test build #108551 has finished for PR 25215 at commit 24c6744.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-02T07:09:09Z

retest this please

HyukjinKwon · 2019-08-02T09:25:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        //    CheckAnalysis guarantees the arguments are deterministic.
+        // 2. PythonUDF in grouping key. Grouping key must be deterministic.
+        // 3. PythonUDF not in grouping key. It is either no arguments or with grouping key
+        // in its arguments. Such PythonUDF was pull out by ExtractPythonUDFFromAggregate, too.


tiny nit: spacing.

HyukjinKwon · 2019-08-02T09:29:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala

+    (None, Some(1)), (Some(3), None), (None, None)).toDF("a", "b")
+
+  test("SPARK-28445: PythonUDF as grouping key and aggregate expressions") {
+    val df1 = base.groupBy(scalaTestUDF(base("a") + 1))


BTW, thanks for changing this into DSL. It was rather a nit that needs some efforts.

SparkQA · 2019-08-02T10:46:00Z

Test build #108554 has finished for PR 25215 at commit 24c6744.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-02T10:47:02Z

Merged to master.

HyukjinKwon · 2019-08-02T11:02:12Z

cc @skonto, @Udbhav30, @shivusondur
When you guys are available, mind making each followup PR to add udf in group-by clause in each JIRA you guys took?

SPARK-28279 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-analytics.sql @skonto

SPARK-28280 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql @skonto

SPARK-28391 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_implicit.sql - @Udbhav30

SPARK-28390 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_having.sql @shivusondur

The PR title should usually be [SPARK-XXXXX][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'xxx.sql' and the procedure to describe PR description will be the same as you guys did before as described in SPARK-27921

… by clause in 'udf-group-analytics.sql' ## What changes were proposed in this pull request? This PR is a followup of a fix as described in here: apache#25215 (comment) <details><summary>Diff comparing to 'group-analytics.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out index 3439a05..de297ab 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out -13,9 +13,9 struct<> -- !query 1 -SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH CUBE +SELECT udf(a + b), b, udf(SUM(a - b)) FROM testData GROUP BY udf(a + b), b WITH CUBE -- !query 1 schema -struct<(a + b):int,b:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast((a - b) as bigint)) as string)) AS BIGINT):bigint> -- !query 1 output 2 1 0 2 NULL 0 -33,9 +33,9 NULL NULL 3 -- !query 2 -SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH CUBE +SELECT udf(a), udf(b), SUM(b) FROM testData GROUP BY udf(a), b WITH CUBE -- !query 2 schema -struct<a:int,b:int,sum(b):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,sum(b):bigint> -- !query 2 output 1 1 1 1 2 2 -52,9 +52,9 NULL NULL 9 -- !query 3 -SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP +SELECT udf(a + b), b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP -- !query 3 schema -struct<(a + b):int,b:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,sum((a - b)):bigint> -- !query 3 output 2 1 0 2 NULL 0 -70,9 +70,9 NULL NULL 3 -- !query 4 -SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH ROLLUP +SELECT udf(a), b, udf(SUM(b)) FROM testData GROUP BY udf(a), b WITH ROLLUP -- !query 4 schema -struct<a:int,b:int,sum(b):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast(b as bigint)) as string)) AS BIGINT):bigint> -- !query 4 output 1 1 1 1 2 2 -97,7 +97,7 struct<> -- !query 6 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY course, year +SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY udf(course), year -- !query 6 schema struct<course:string,year:int,sum(earnings):bigint> -- !query 6 output -111,7 +111,7 dotNET 2013 48000 -- !query 7 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, year +SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, udf(year) -- !query 7 schema struct<course:string,year:int,sum(earnings):bigint> -- !query 7 output -127,9 +127,9 dotNET 2013 48000 -- !query 8 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year) +SELECT course, udf(year), SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year) -- !query 8 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<course:string,CAST(udf(cast(year as string)) AS INT):int,sum(earnings):bigint> -- !query 8 output Java NULL 50000 NULL 2012 35000 -138,26 +138,26 dotNET NULL 63000 -- !query 9 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course) +SELECT course, year, udf(SUM(earnings)) FROM courseSales GROUP BY course, year GROUPING SETS(course) -- !query 9 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<course:string,year:int,CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint> -- !query 9 output Java NULL 50000 dotNET NULL 63000 -- !query 10 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year) +SELECT udf(course), year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year) -- !query 10 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<CAST(udf(cast(course as string)) AS STRING):string,year:int,sum(earnings):bigint> -- !query 10 output NULL 2012 35000 NULL 2013 78000 -- !query 11 -SELECT course, SUM(earnings) AS sum FROM courseSales -GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum +SELECT course, udf(SUM(earnings)) AS sum FROM courseSales +GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, udf(sum) -- !query 11 schema struct<course:string,sum:bigint> -- !query 11 output -173,7 +173,7 dotNET 63000 -- !query 12 SELECT course, SUM(earnings) AS sum, GROUPING_ID(course, earnings) FROM courseSales -GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum +GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY udf(course), sum -- !query 12 schema struct<course:string,sum:bigint,grouping_id(course, earnings):int> -- !query 12 output -188,10 +188,10 dotNET 63000 1 -- !query 13 -SELECT course, year, GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales +SELECT udf(course), udf(year), GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year) -- !query 13 schema -struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int> +struct<CAST(udf(cast(course as string)) AS STRING):string,CAST(udf(cast(year as string)) AS INT):int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int> -- !query 13 output Java 2012 0 0 0 Java 2013 0 0 0 -205,7 +205,7 dotNET NULL 0 1 1 -- !query 14 -SELECT course, year, GROUPING(course) FROM courseSales GROUP BY course, year +SELECT course, udf(year), GROUPING(course) FROM courseSales GROUP BY course, udf(year) -- !query 14 schema struct<> -- !query 14 output -214,7 +214,7 grouping() can only be used with GroupingSets/Cube/Rollup; -- !query 15 -SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY course, year +SELECT course, udf(year), GROUPING_ID(course, year) FROM courseSales GROUP BY udf(course), year -- !query 15 schema struct<> -- !query 15 output -223,7 +223,7 grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 16 -SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year +SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, udf(year) -- !query 16 schema struct<course:string,year:int,grouping__id:int> -- !query 16 output -240,7 +240,7 NULL NULL 3 -- !query 17 SELECT course, year FROM courseSales GROUP BY CUBE(course, year) -HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, year +HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, udf(year) -- !query 17 schema struct<course:string,year:int> -- !query 17 output -250,7 +250,7 dotNET NULL -- !query 18 -SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0 +SELECT course, udf(year) FROM courseSales GROUP BY udf(course), year HAVING GROUPING(course) > 0 -- !query 18 schema struct<> -- !query 18 output -259,7 +259,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 19 -SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0 +SELECT course, udf(udf(year)) FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0 -- !query 19 schema struct<> -- !query 19 output -268,9 +268,9 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 20 -SELECT course, year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0 +SELECT udf(course), year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0 -- !query 20 schema -struct<course:string,year:int> +struct<CAST(udf(cast(course as string)) AS STRING):string,year:int> -- !query 20 output Java NULL NULL 2012 -281,7 +281,7 dotNET NULL -- !query 21 SELECT course, year, GROUPING(course), GROUPING(year) FROM courseSales GROUP BY CUBE(course, year) -ORDER BY GROUPING(course), GROUPING(year), course, year +ORDER BY GROUPING(course), GROUPING(year), course, udf(year) -- !query 21 schema struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint> -- !query 21 output -298,7 +298,7 NULL NULL 1 1 -- !query 22 SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year) -ORDER BY GROUPING(course), GROUPING(year), course, year +ORDER BY GROUPING(course), GROUPING(year), course, udf(year) -- !query 22 schema struct<course:string,year:int,grouping_id(course, year):int> -- !query 22 output -314,7 +314,7 NULL NULL 3 -- !query 23 -SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING(course) +SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING(course) -- !query 23 schema struct<> -- !query 23 output -323,7 +323,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 24 -SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING_ID(course) +SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING_ID(course) -- !query 24 schema struct<> -- !query 24 output -332,7 +332,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 25 -SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year +SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, udf(course), year -- !query 25 schema struct<course:string,year:int> -- !query 25 output -348,7 +348,7 NULL NULL -- !query 26 -SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2) +SELECT udf(a + b) AS k1, udf(b) AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2) -- !query 26 schema struct<k1:int,k2:int,sum((a - b)):bigint> -- !query 26 output -368,7 +368,7 NULL NULL 3 -- !query 27 -SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b) +SELECT udf(udf(a + b)) AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b) -- !query 27 schema struct<k:int,b:int,sum((a - b)):bigint> -- !query 27 output -386,9 +386,9 NULL NULL 3 -- !query 28 -SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k) +SELECT udf(a + b), udf(udf(b)) AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k) -- !query 28 schema -struct<(a + b):int,k:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,k:int,sum((a - b)):bigint> -- !query 28 output NULL 1 3 NULL 2 0 ``` </p> </details> ## How was this patch tested? Tested as instructed in SPARK-27921. Closes apache#25362 from skonto/group-analytics-followup. Authored-by: Stavros Kontopoulos <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

shivusondur · 2019-08-06T12:51:44Z

cc @skonto, @Udbhav30, @shivusondur
When you guys are available, mind making each followup PR to add udf in group-by clause in each JIRA you guys took?

SPARK-28279 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-analytics.sql @skonto

SPARK-28280 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql @skonto

SPARK-28391 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_implicit.sql - @Udbhav30

SPARK-28390 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_having.sql @shivusondur

The PR title should usually be [SPARK-XXXXX][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'xxx.sql' and the procedure to describe PR description will be the same as you guys did before as described in SPARK-27921

@HyukjinKwon
I am still getting below error, after adding udf to groupby values, i tried for individual values also same issue
'''
-- !query 11
SELECT udf(b), udf(c) FROM test_having
GROUP BY udf(b), udf(c) HAVING udf(count(*)) = 1 ORDER BY udf(b), udf(c)
-- !query 11 schema
struct<>
-- !query 11 output
org.apache.spark.sql.AnalysisException
cannot resolve 'b' given input columns: [CAST(udf(cast(b as string)) AS INT), CAST(udf(cast(c as string)) AS STRING)]; line 2 pos 63
'''

viirya · 2019-08-06T23:51:46Z

@shivusondur @HyukjinKwon The analysis exception by adding udf to group by, is caused by SPARK-28386, SPARK-26741.

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input columns: [CAST(udf(cast(b as string)) AS INT), CAST(udf(cast(c as string)) AS STRING)]; line 2 pos 63;
'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
+- Project [CAST(udf(cast(b as string)) AS INT)#x, CAST(udf(cast(c as string)) AS STRING)#x]
   +- Filter (cast(udf(cast(count(1)#xL as string)) as bigint) = cast(1 as bigint))
      +- Aggregate [cast(udf(cast(b#x as string)) as int), cast(udf(cast(c#x as string)) as string)], [cast(udf(cast(b#x as string)) as int) AS CAST(udf(cast(b as string)) AS INT)#x, cast(udf(cast(c#x as string)) as string) AS CAST(udf(cast(c as string)) AS STRING)#x, count(1) AS count(1)#xL]
         +- SubqueryAlias `default`.`test_having`
            +- Relation[a#x,b#x,c#x,d#x] parquet

HyukjinKwon · 2019-08-07T01:00:50Z

Thanks, @viirya. @shivusondur Can you comment the test out with those JIRA numbers?

fix error when PythonUDF is used in group by.

b3293fc

This comment has been minimized.

Sign in to view

viirya mentioned this pull request Jul 22, 2019

[SPARK-28280][SQL][PYTHON][TESTS] Convert and port 'group-by.sql' into UDF test base #25098

Closed

HyukjinKwon reviewed Jul 22, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 22, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 22, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

Address comments.

34531b4

viirya commented Jul 22, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 22, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

Move tests to PythonUDFSuite.

33a5e0d

dongjoon-hyun added PYSPARK SQL labels Jul 22, 2019

This comment has been minimized.

Sign in to view

skonto reviewed Jul 22, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala Show resolved Hide resolved

Address comment.

3b3472e

cloud-fan reviewed Jul 29, 2019

View reviewed changes

cloud-fan reviewed Jul 30, 2019

View reviewed changes

Address comments.

c852142

cloud-fan reviewed Aug 2, 2019

View reviewed changes

For comment.

24c6744

cloud-fan reviewed Aug 2, 2019

View reviewed changes

cloud-fan approved these changes Aug 2, 2019

View reviewed changes

HyukjinKwon reviewed Aug 2, 2019

View reviewed changes

HyukjinKwon approved these changes Aug 2, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-28445][SQL][Python] Fix error when PythonUDF is used in both group by and aggregate expression~~ [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression Aug 2, 2019

HyukjinKwon closed this in 77c7e91 Aug 2, 2019

This was referenced Aug 5, 2019

[SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql' #25360

Closed

[SPARK-28279][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-analytics.sql' #25362

Closed

shivusondur mentioned this pull request Aug 12, 2019

[SPARK-28390][SQL][PYTHON][TESTS] [FOLLOW-UP] Update the TODO with actual blocking JIRA IDs #25415

Closed

viirya deleted the SPARK-28445 branch December 27, 2023 18:22

[SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression #25215

[SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression #25215

Uh oh!

Conversation

viirya commented Jul 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

This comment has been minimized.

viirya commented Jul 21, 2019

Uh oh!

viirya commented Jul 21, 2019

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Jul 22, 2019

Uh oh!

Uh oh!

SparkQA commented Jul 22, 2019

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

HyukjinKwon commented Jul 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2019

Uh oh!

SparkQA commented Aug 2, 2019

Uh oh!

HyukjinKwon commented Aug 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2019

Uh oh!

HyukjinKwon commented Aug 2, 2019

Uh oh!

HyukjinKwon commented Aug 2, 2019

Uh oh!

shivusondur commented Aug 6, 2019

Uh oh!

viirya commented Aug 6, 2019

Uh oh!

HyukjinKwon commented Aug 7, 2019

Uh oh!

Reviewers

Assignees

Labels

viirya commented Jul 21, 2019 •

edited

Loading

viirya Aug 2, 2019 •

edited

Loading