-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results #16467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
| struct<a1:int,b1:int> | ||
| -- !query 2 output | ||
| 2 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an empty set without this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is (null, 2) missing? There is no tuple in t2 for which b2=2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider this:
(null, 2) NOT IN { (1, 1), (null, 3), (1, null) }
which is equal to
.... AND (null <> 1 OR 2 <> null) => ... AND (unknown OR unknown)
=> ... AND unknown
=> unknown
Therefore (null, 2) is not part of the result set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah you are right. I was confusing this with the or rules.
| struct<a1:int,b1:int> | ||
| -- !query 3 output | ||
| 1 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an empty set without this fix.
| -- !query 4 schema | ||
| struct<a1:int,b1:int> | ||
| -- !query 4 output | ||
| 1 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an empty set without this fix.
| if (!parent.exists()) { | ||
| assert(parent.mkdirs(), "Could not create directory: " + parent) | ||
| } | ||
| stringToFile(resultFile, goldenOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This newly added code is to address an issue, when test files are located in a hierarchy of sub-directories, at the time the golden result files are generated it could happen that the structure of those sub-directories are not yet created. The code will create the required sub-directories.
| sql("select * from l where a not in (select c from t where b < d)"), | ||
| Row(1, 2.0) :: Row(1, 2.0) :: Row(3, 3.0) :: Nil) | ||
| sql("select * from l where (a, b) not in (select c, d from t) and a < 4"), | ||
| Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) :: Row(3, 3.0) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query with correlated predicates in NOT IN subquery could generate incorrect results (this problem is tracked by SPARK-18966). With this fix, it reveals the problem. Here I modify the test case to cover the code path for multiple columns instead.
| // Empty sub-query | ||
| checkAnswer( | ||
| sql("select * from l where a not in (select c from r where c > 10 and b < d)"), | ||
| sql("select * from l where (a, b) not in (select c, d from r where c > 10)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the predicate c > 10 (which filters all the rows in the subquery), it covers up the correlated predicate problem. Instead of removing the test case completely, I just modify to have a different coverage for multiple columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we also should test an empty subquery :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is effectively covering the case of empty subquery.
|
The new test file is placed in the directory structure similar to the one proposed in SPARK-18871 (#16337). |
|
Test build #70849 has finished for PR 16467 at commit
|
|
I left a comment on the JIRA. It feel it is important to reach consensus on the semantics first. Other than that the PR looks good. |
|
Thank you for taking time to review the PR. I agree we need to reach consensus on the semantics. |
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. I have one question regarding the result of a test case. Otherwise only minor things.
| // Empty sub-query | ||
| checkAnswer( | ||
| sql("select * from l where a not in (select c from r where c > 10 and b < d)"), | ||
| sql("select * from l where (a, b) not in (select c, d from r where c > 10)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we also should test an empty subquery :)
| // to | ||
| // (a1=a2 OR isnull(a1=a2)) AND (b1=b2 OR isnull(b1=b2)) AND ... | ||
| val joinConds = splitConjunctivePredicates(joinCond.get) | ||
| val isNulls = joinConds.map(IsNull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - We can write this more directly:
val pairs = joinConds.map(c => Or(c, IsNull(c))).reduceLeft(And)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I have changed the code.
| struct<a1:int,b1:int> | ||
| -- !query 2 output | ||
| 2 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is (null, 2) missing? There is no tuple in t2 for which b2=2.
|
Test build #70948 has finished for PR 16467 at commit
|
|
@hvanhovell, please let me know if you are waiting for my replies to your comments. I believe I have addressed all of them. thanks! |
|
Test build #71283 has finished for PR 16467 at commit
|
|
@hvanhovell Would there be anything left that I have not addressed in this PR? |
|
LGTM - merging to master/2.1. Thanks! Sorry for the long wait. |
…rn incorrect results
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
Example:
The query
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);
has the (a1, b1) = (a2, b2) rewritten from (before this fix):
Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
to (after this fix):
Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
## How was this patch tested?
sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
Author: Nattavut Sutyanyong <[email protected]>
Closes #16467 from nsyca/19017.
(cherry picked from commit cdb691e)
Signed-off-by: Herman van Hovell <[email protected]>
…rn incorrect results
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
Example:
The query
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);
has the (a1, b1) = (a2, b2) rewritten from (before this fix):
Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
to (after this fix):
Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
## How was this patch tested?
sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
Author: Nattavut Sutyanyong <[email protected]>
Closes apache#16467 from nsyca/19017.
…rn incorrect results
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
Example:
The query
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);
has the (a1, b1) = (a2, b2) rewritten from (before this fix):
Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
to (after this fix):
Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
## How was this patch tested?
sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
Author: Nattavut Sutyanyong <[email protected]>
Closes apache#16467 from nsyca/19017.
…rn incorrect results
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
Example:
The query
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);
has the (a1, b1) = (a2, b2) rewritten from (before this fix):
Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
to (after this fix):
Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
## How was this patch tested?
sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
Author: Nattavut Sutyanyong <[email protected]>
Closes apache#16467 from nsyca/19017.
What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule
RewritePredicateSubquery.Example:
The query
select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);
has the (a1, b1) = (a2, b2) rewritten from (before this fix):
Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
to (after this fix):
Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
How was this patch tested?
sql/test, catalyst/test and new test cases in SQLQueryTestSuite.