[SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error #35404

amaliujia · 2022-02-06T07:14:47Z

What changes were proposed in this pull request?

with t as (select true c)
select t.c
from t
group by t.c
having mean(t.c) > 0

This query throws Column 't.c' does not exist. Did you mean one of the following? [t.c]

However, mean(boolean) is not a supported function signature, thus error result should be cannot resolve 'mean(t.c)' due to data type mismatch: function average requires numeric or interval types, not boolean

This is because

The mean(boolean) in HAVING was not marked as resolved in ResolveFunctions rule.
Thus in ResolveAggregationFunctions, the TempResolvedColumn as a wrapper in mean(TempResolvedColumn(t.c)) cannot be removed (only resolved AGG can remove its’s TempResolvedColumn).
Thus in a later batch rule applying, TempResolvedColumn was reverted and it becomes mean(t.c), so mean loses the information about t.c.
Thus at the last step, the analyzer can only report t.c not found.

mean(boolean) in HAVING is not marked as resolved in {{ResolveFunctions}} rule because

It uses Expression default resolved field population code:
lazy val resolved: Boolean = childrenResolved && checkInputDataTypes().isSuccess
During the analyzing, mean(boolean) is mean(TempResolveColumn(boolean), thus childrenResolved is true.
however checkInputDataTypes() will be false Average.scala#L55
Thus eventually Average's resolved will be false, but it leads to wrong error message.

Why are the changes needed?

Improve error message so users can better debug their query.

Does this PR introduce any user-facing change?

Yes. This will change user-facing error message.

How was this patch tested?

Unit Test

amaliujia · 2022-02-06T07:18:14Z

R: @cloud-fan

AmplabJenkins · 2022-02-06T07:38:55Z

Can one of the admins verify this patch?

amaliujia · 2022-02-06T19:18:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala

Before this change, this resolved field of Average is set based on both children all resolved and input data type match (the default Expression implementation). However this will lead to un-resolved column name with Average in HAVING due to the logic of TempResolveColumn, which lead to column not found (but the column was found).

If we set this field only based on if all children are resolved, later in CheckAnalysis it will check expression input data type, and throw right data type mismatch error:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Line 189 in 0d56c94

case e: Expression if e.checkInputDataTypes().isFailure =>

It's intentional that Expression.resolved also checks the input data types. It seems weird to me that we only change it for Average...

HyukjinKwon · 2022-02-07T01:51:20Z

cc @allisonwang-db FYI

allisonwang-db · 2022-02-07T19:35:08Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

nit typo: claus -> clause

amaliujia · 2022-02-09T23:55:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@cloud-fan

I updated this PR based on our discussion.

Right now my implementation could fail a few queries (for example SQLQuerySuite.normalize special floating numbers in subquery).

Invalid call to dataType on unresolved object org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137) at org.apache.spark.sql.catalyst.expressions.BinaryOperator.checkInputDataTypes(Expression.scala:713) at org.apache.spark.sql.catalyst.expressions.BinaryComparison.checkInputDataTypes(predicates.scala:908) at org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4262)

I believe it is because I called checkInputDataTypes() on a binary comparator that has at least one side as UnresolvedAttribute. Do you know how to filter such case in my implementation? Basically how to filter out an Expression based on its nested children if there is any one is UnresolvedAttribute?

I am not familiar with scala and our codebase. Any help will be much appreciated.

I tried those functions like transformWithPruning, unfortunately didn't make it work.

A straightforward fix is to change it to case e: Expression if e.childrenResolved && e.checkInputDataTypes().isFailure.

A new idea is to not strip TempResolvedColumn, as TempResolvedColumn always means a failure. We can update CheckAnalysis to handle it, i.e. adding a new case after case e: Expression if e.checkInputDataTypes().isFailure =>

case t: TempResolvedColumn => val a = UnresolvedAttribute(t.nameParts) the same code that handles "case a: Attribute if !a.resolved"

@cloud-fan

thank you! e.childrenResolved is a handy call and it indeed solves problem!

I am still checking error at RemoveTempResolvedColumn. If you actually prefer to check it at CheckAnalysis, let me know and I can make a change.

In fact I am not sure the consequence of not striping TempResolvedColumn. I would guess there was a reason to add that RemoveTempResolvedColumn rule/batch? Otherwise why not deal with TempResolvedColumn in CheckAnalysis when TempResolvedColumn was introduced?

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

cloud-fan · 2022-02-11T02:33:27Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

shall we strip tempresolvedcolumn before generating the error message?

I override def sql:string in TempResolvedColumn to strip tempresolvedcolumn in error message. let me know if there is a better way to do so.

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

cloud-fan · 2022-02-11T06:19:03Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

seems the only difference between this test and the first one is we replace mean with abs. How about

Seq("mean", "abs").foreach { func => val e1 = intercept... ... |HAVING $func(t.c) ... ... val e2 = ... }

Good idea! Done!

cloud-fan · 2022-02-11T06:56:03Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

we don't need to check the full error message

assert(e1.message.contains(s"cannot resolve '$func(t.c)' due to data type mismatch"))

I see. Done!

allisonwang-db · 2022-02-11T21:27:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

How about handling TempResolvedColumn in CheckAnalysis? For example:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Lines 161 to 166 in 25a4c5f

// Check argument data types of higher-order functions downwards first.

// If the arguments of the higher-order functions are resolved but the type check fails,

// the argument functions will not get resolved, but we should report the argument type

// check failure instead of claiming the argument functions are unresolved.

operator transformExpressionsDown {

case hof: HigherOrderFunction

Are there new issues if we keep TempResolvedColumn during the analysis?

See #35404 (comment).

Unfortunately I don't have enough knowledge why removing TempResolvedColumn was introduced (it was introduced in #32470)

allisonwang-db · 2022-02-11T21:28:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

child.sql?

seems just equivalent?

yea equivalent, but child.sql is simpler.

amaliujia · 2022-02-14T23:51:30Z

@cloud-fan friendly ping

cloud-fan

LGTM, please fix the conflicts.

…mismatch error.

amaliujia · 2022-02-16T21:47:05Z

Conflicts were resolved.

dongjoon-hyun

Thank you for resolving conflicts, @amaliujia .

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

dongjoon-hyun

+1, LGTM. Thank you for updates, @amaliujia .
(Pending CIs)

dongjoon-hyun · 2022-02-18T00:01:13Z

Merged to master for Apache Spark 3.3.0.

@amaliujia . I added you to the Apache Spark contributor group and assigned SPARK-38118 to you.
Welcome to the Apache Spark community.

amaliujia · 2022-02-18T00:13:30Z

Thank you! @dongjoon-hyun

### What changes were proposed in this pull request? This is a followup of #35404 and #36746 , to simplify the error handling of `TempResolvedColumn`. The idea is: 1. The rule `ResolveAggregationFunctions` in the main resolution batch creates `TempResolvedColumn` and only removes it if the aggregate expression is fully resolved. It either strips `TempResolvedColumn` if it's inside aggregate function or group expression, or restores `TempResolvedColumn` to `UnresolvedAttribute` otherwise, hoping other rules can resolve it. 2. The rule `RemoveTempResolvedColumn` in a latter batch can still hit `TempResolvedColumn` if the aggregate expression is unresolved (due to input type mismatch for example, e.g. `avg(bool_col)`, `date_add(int_group_col, 1)`). At this stage, there is no way to restore `TempResolvedColumn` to `UnresolvedAttribute` and resolve it differently. The query will fail and we should blindly strip `TempResolvedColumn` to provide better error message. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #36809 from cloud-fan/error. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Feb 6, 2022

amaliujia commented Feb 6, 2022

View reviewed changes

allisonwang-db reviewed Feb 7, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated

Copy link

Contributor

allisonwang-db Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: claus -> clause

amaliujia changed the title ~~[SPARK-38118][SQL] MEAN(Boolean) in the HAVING claus should throw data mismatch error~~ [SPARK-38118][SQL] Func(wrong data type) in the HAVING claus should throw data mismatch error Feb 9, 2022

amaliujia commented Feb 9, 2022

View reviewed changes

amaliujia commented Feb 10, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 11, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 11, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 11, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 11, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 11, 2022

View reviewed changes

allisonwang-db reviewed Feb 11, 2022

View reviewed changes

cloud-fan approved these changes Feb 16, 2022

View reviewed changes

amaliujia added 2 commits February 16, 2022 13:20

[SPARK-38118] func(wrong type) in the HAVING claus should throw data …

23699ce

…mismatch error.

address comments

faa12dd

amaliujia force-pushed the meanboolean branch from 2351590 to faa12dd Compare February 16, 2022 21:46

dongjoon-hyun changed the title ~~[SPARK-38118][SQL] Func(wrong data type) in the HAVING claus should throw data mismatch error~~ [SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error Feb 17, 2022

dongjoon-hyun reviewed Feb 17, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Feb 17, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Feb 17, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

update

850e500

dongjoon-hyun approved these changes Feb 17, 2022

View reviewed changes

dongjoon-hyun closed this in 4070ea8 Feb 17, 2022

amaliujia deleted the meanboolean branch February 18, 2022 00:13

cloud-fan mentioned this pull request Jun 14, 2022

[SPARK-39488][SQL] Simplify the error handling of TempResolvedColumn #36809

Closed

	// Check argument data types of higher-order functions downwards first.
	// If the arguments of the higher-order functions are resolved but the type check fails,
	// the argument functions will not get resolved, but we should report the argument type
	// check failure instead of claiming the argument functions are unresolved.
	operator transformExpressionsDown {
	case hof: HigherOrderFunction

[SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error #35404

[SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error #35404

Uh oh!

Conversation

amaliujia commented Feb 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Feb 6, 2022

Uh oh!

AmplabJenkins commented Feb 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloud-fan Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Feb 14, 2022

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Feb 16, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amaliujia commented Feb 6, 2022 •

edited

Loading

amaliujia Feb 9, 2022 •

edited

Loading

amaliujia Feb 10, 2022 •

edited

Loading

cloud-fan Feb 11, 2022 •

edited

Loading

dongjoon-hyun commented Feb 18, 2022 •

edited

Loading