-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error #35404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @cloud-fan |
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change, this resolved field of Average is set based on both children all resolved and input data type match (the default Expression implementation). However this will lead to un-resolved column name with Average in HAVING due to the logic of TempResolveColumn, which lead to column not found (but the column was found).
If we set this field only based on if all children are resolved, later in CheckAnalysis it will check expression input data type, and throw right data type mismatch error:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Line 189 in 0d56c94
| case e: Expression if e.checkInputDataTypes().isFailure => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional that Expression.resolved also checks the input data types. It seems weird to me that we only change it for Average...
|
cc @allisonwang-db FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit typo: claus -> clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR based on our discussion.
Right now my implementation could fail a few queries (for example SQLQuerySuite.normalize special floating numbers in subquery).
Invalid call to dataType on unresolved object
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137)
at org.apache.spark.sql.catalyst.expressions.BinaryOperator.checkInputDataTypes(Expression.scala:713)
at org.apache.spark.sql.catalyst.expressions.BinaryComparison.checkInputDataTypes(predicates.scala:908)
at org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4262)
I believe it is because I called checkInputDataTypes() on a binary comparator that has at least one side as UnresolvedAttribute. Do you know how to filter such case in my implementation? Basically how to filter out an Expression based on its nested children if there is any one is UnresolvedAttribute?
I am not familiar with scala and our codebase. Any help will be much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried those functions like transformWithPruning, unfortunately didn't make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A straightforward fix is to change it to case e: Expression if e.childrenResolved && e.checkInputDataTypes().isFailure.
A new idea is to not strip TempResolvedColumn, as TempResolvedColumn always means a failure. We can update CheckAnalysis to handle it, i.e. adding a new case after case e: Expression if e.checkInputDataTypes().isFailure =>
case t: TempResolvedColumn =>
val a = UnresolvedAttribute(t.nameParts)
the same code that handles "case a: Attribute if !a.resolved"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! e.childrenResolved is a handy call and it indeed solves problem!
I am still checking error at RemoveTempResolvedColumn. If you actually prefer to check it at CheckAnalysis, let me know and I can make a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I am not sure the consequence of not striping TempResolvedColumn. I would guess there was a reason to add that RemoveTempResolvedColumn rule/batch? Otherwise why not deal with TempResolvedColumn in CheckAnalysis when TempResolvedColumn was introduced?
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we strip tempresolvedcolumn before generating the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I override def sql:string in TempResolvedColumn to strip tempresolvedcolumn in error message. let me know if there is a better way to do so.
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the only difference between this test and the first one is we replace mean with abs. How about
Seq("mean", "abs").foreach { func =>
val e1 = intercept...
...
|HAVING $func(t.c) ...
...
val e2 = ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to check the full error message
assert(e1.message.contains(s"cannot resolve '$func(t.c)' due to data type mismatch"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about handling TempResolvedColumn in CheckAnalysis? For example:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Lines 161 to 166 in 25a4c5f
| // Check argument data types of higher-order functions downwards first. | |
| // If the arguments of the higher-order functions are resolved but the type check fails, | |
| // the argument functions will not get resolved, but we should report the argument type | |
| // check failure instead of claiming the argument functions are unresolved. | |
| operator transformExpressionsDown { | |
| case hof: HigherOrderFunction |
Are there new issues if we keep TempResolvedColumn during the analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #35404 (comment).
Unfortunately I don't have enough knowledge why removing TempResolvedColumn was introduced (it was introduced in #32470)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child.sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems just equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea equivalent, but child.sql is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
@cloud-fan friendly ping |
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix the conflicts.
2351590 to
faa12dd
Compare
|
Conflicts were resolved. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for resolving conflicts, @amaliujia .
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for updates, @amaliujia .
(Pending CIs)
|
Merged to master for Apache Spark 3.3.0. @amaliujia . I added you to the Apache Spark contributor group and assigned SPARK-38118 to you. |
|
Thank you! @dongjoon-hyun |
### What changes were proposed in this pull request? This is a followup of #35404 and #36746 , to simplify the error handling of `TempResolvedColumn`. The idea is: 1. The rule `ResolveAggregationFunctions` in the main resolution batch creates `TempResolvedColumn` and only removes it if the aggregate expression is fully resolved. It either strips `TempResolvedColumn` if it's inside aggregate function or group expression, or restores `TempResolvedColumn` to `UnresolvedAttribute` otherwise, hoping other rules can resolve it. 2. The rule `RemoveTempResolvedColumn` in a latter batch can still hit `TempResolvedColumn` if the aggregate expression is unresolved (due to input type mismatch for example, e.g. `avg(bool_col)`, `date_add(int_group_col, 1)`). At this stage, there is no way to restore `TempResolvedColumn` to `UnresolvedAttribute` and resolve it differently. The query will fail and we should blindly strip `TempResolvedColumn` to provide better error message. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #36809 from cloud-fan/error. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This query throws
Column 't.c' does not exist. Did you mean one of the following? [t.c]However, mean(boolean) is not a supported function signature, thus error result should be
cannot resolve 'mean(t.c)' due to data type mismatch: function average requires numeric or interval types, not booleanThis is because
ResolveFunctionsrule.ResolveAggregationFunctions, theTempResolvedColumnas a wrapper inmean(TempResolvedColumn(t.c))cannot be removed (only resolved AGG can remove its’sTempResolvedColumn).TempResolvedColumnwas reverted and it becomes mean(t.c), so mean loses the information about t.c.mean(boolean) in HAVING is not marked as resolved in {{ResolveFunctions}} rule because
resolvedfield population code:lazy val resolved: Boolean = childrenResolved && checkInputDataTypes().isSuccessresolvedwill be false, but it leads to wrong error message.Why are the changes needed?
Improve error message so users can better debug their query.
Does this PR introduce any user-facing change?
Yes. This will change user-facing error message.
How was this patch tested?
Unit Test