Skip to content

Conversation

@amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Feb 6, 2022

What changes were proposed in this pull request?

with t as (select true c)
select t.c
from t
group by t.c
having mean(t.c) > 0

This query throws Column 't.c' does not exist. Did you mean one of the following? [t.c]

However, mean(boolean) is not a supported function signature, thus error result should be  cannot resolve 'mean(t.c)' due to data type mismatch: function average requires numeric or interval types, not boolean

 

This is because

  1. The mean(boolean) in HAVING was not marked as resolved in ResolveFunctions rule.
  2. Thus in ResolveAggregationFunctions, the TempResolvedColumn as a wrapper in mean(TempResolvedColumn(t.c)) cannot be removed (only resolved AGG can remove its’s TempResolvedColumn).
  3. Thus in a later batch rule applying,  TempResolvedColumn was reverted and it becomes mean(t.c), so mean loses the information about t.c.
  4. Thus at the last step, the analyzer can only report t.c not found.

 

mean(boolean) in HAVING is not marked as resolved in {{ResolveFunctions}} rule because 

  1. It uses Expression default resolved field population code:
    lazy val resolved: Boolean = childrenResolved && checkInputDataTypes().isSuccess
  2. During the analyzing,  mean(boolean) is mean(TempResolveColumn(boolean), thus childrenResolved is true.
  3. however checkInputDataTypes() will be false Average.scala#L55
  4. Thus eventually Average's resolved  will be false, but it leads to wrong error message.

Why are the changes needed?

Improve error message so users can better debug their query.

Does this PR introduce any user-facing change?

Yes. This will change user-facing error message.

How was this patch tested?

Unit Test

@github-actions github-actions bot added the SQL label Feb 6, 2022
@amaliujia
Copy link
Contributor Author

R: @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, this resolved field of Average is set based on both children all resolved and input data type match (the default Expression implementation). However this will lead to un-resolved column name with Average in HAVING due to the logic of TempResolveColumn, which lead to column not found (but the column was found).

If we set this field only based on if all children are resolved, later in CheckAnalysis it will check expression input data type, and throw right data type mismatch error:

case e: Expression if e.checkInputDataTypes().isFailure =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional that Expression.resolved also checks the input data types. It seems weird to me that we only change it for Average...

@HyukjinKwon
Copy link
Member

cc @allisonwang-db FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: claus -> clause

@amaliujia amaliujia changed the title [SPARK-38118][SQL] MEAN(Boolean) in the HAVING claus should throw data mismatch error [SPARK-38118][SQL] Func(wrong data type) in the HAVING claus should throw data mismatch error Feb 9, 2022
Copy link
Contributor Author

@amaliujia amaliujia Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan

I updated this PR based on our discussion.

Right now my implementation could fail a few queries (for example SQLQuerySuite.normalize special floating numbers in subquery).

Invalid call to dataType on unresolved object
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object
	at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137)
	at org.apache.spark.sql.catalyst.expressions.BinaryOperator.checkInputDataTypes(Expression.scala:713)
	at org.apache.spark.sql.catalyst.expressions.BinaryComparison.checkInputDataTypes(predicates.scala:908)
	at org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4262)

I believe it is because I called checkInputDataTypes() on a binary comparator that has at least one side as UnresolvedAttribute. Do you know how to filter such case in my implementation? Basically how to filter out an Expression based on its nested children if there is any one is UnresolvedAttribute?

I am not familiar with scala and our codebase. Any help will be much appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried those functions like transformWithPruning, unfortunately didn't make it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A straightforward fix is to change it to case e: Expression if e.childrenResolved && e.checkInputDataTypes().isFailure.

A new idea is to not strip TempResolvedColumn, as TempResolvedColumn always means a failure. We can update CheckAnalysis to handle it, i.e. adding a new case after case e: Expression if e.checkInputDataTypes().isFailure =>

case t: TempResolvedColumn =>
  val a = UnresolvedAttribute(t.nameParts)
  the same code that handles "case a: Attribute if !a.resolved"

Copy link
Contributor Author

@amaliujia amaliujia Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan

thank you! e.childrenResolved is a handy call and it indeed solves problem!

I am still checking error at RemoveTempResolvedColumn. If you actually prefer to check it at CheckAnalysis, let me know and I can make a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I am not sure the consequence of not striping TempResolvedColumn. I would guess there was a reason to add that RemoveTempResolvedColumn rule/batch? Otherwise why not deal with TempResolvedColumn in CheckAnalysis when TempResolvedColumn was introduced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we strip tempresolvedcolumn before generating the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I override def sql:string in TempResolvedColumn to strip tempresolvedcolumn in error message. let me know if there is a better way to do so.

Copy link
Contributor

@cloud-fan cloud-fan Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the only difference between this test and the first one is we replace mean with abs. How about

Seq("mean", "abs").foreach {  func =>
  val e1 = intercept...
    ...
    |HAVING $func(t.c) ...
    ...

  val e2 = ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check the full error message

assert(e1.message.contains(s"cannot resolve '$func(t.c)' due to data type mismatch"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about handling TempResolvedColumn in CheckAnalysis? For example:

// Check argument data types of higher-order functions downwards first.
// If the arguments of the higher-order functions are resolved but the type check fails,
// the argument functions will not get resolved, but we should report the argument type
// check failure instead of claiming the argument functions are unresolved.
operator transformExpressionsDown {
case hof: HigherOrderFunction

Are there new issues if we keep TempResolvedColumn during the analysis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #35404 (comment).

Unfortunately I don't have enough knowledge why removing TempResolvedColumn was introduced (it was introduced in #32470)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child.sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems just equivalent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea equivalent, but child.sql is simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@amaliujia
Copy link
Contributor Author

@cloud-fan friendly ping

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the conflicts.

@amaliujia
Copy link
Contributor Author

Conflicts were resolved.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38118][SQL] Func(wrong data type) in the HAVING claus should throw data mismatch error [SPARK-38118][SQL] Func(wrong data type) in HAVING clause should throw data mismatch error Feb 17, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for resolving conflicts, @amaliujia .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for updates, @amaliujia .
(Pending CIs)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 18, 2022

Merged to master for Apache Spark 3.3.0.

@amaliujia . I added you to the Apache Spark contributor group and assigned SPARK-38118 to you.
Welcome to the Apache Spark community.

@amaliujia amaliujia deleted the meanboolean branch February 18, 2022 00:13
@amaliujia
Copy link
Contributor Author

Thank you! @dongjoon-hyun

cloud-fan added a commit that referenced this pull request Jun 16, 2022
### What changes were proposed in this pull request?

This is a followup of #35404 and #36746 , to simplify the error handling of `TempResolvedColumn`.

The idea is:
1. The rule `ResolveAggregationFunctions` in the main resolution batch creates `TempResolvedColumn` and only removes it if the aggregate expression is fully resolved. It either strips `TempResolvedColumn` if it's inside aggregate function or group expression, or restores `TempResolvedColumn` to `UnresolvedAttribute` otherwise, hoping other rules can resolve it.
2. The rule `RemoveTempResolvedColumn` in a latter batch can still hit `TempResolvedColumn` if the aggregate expression is unresolved (due to input type mismatch for example, e.g. `avg(bool_col)`, `date_add(int_group_col, 1)`). At this stage, there is no way to restore `TempResolvedColumn` to `UnresolvedAttribute`  and resolve it differently. The query will fail and we should blindly strip `TempResolvedColumn` to provide better error message.
### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #36809 from cloud-fan/error.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants