Skip to content

Conversation

@nsyca
Copy link
Contributor

@nsyca nsyca commented Jan 4, 2017

What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule RewritePredicateSubquery.

Example:
The query

select a1,b1
from t1
where (a1,b1) not in (select a2,b2
from t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

nsyca added 16 commits July 29, 2016 17:43
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
struct<a1:int,b1:int>
-- !query 2 output
2 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is (null, 2) missing? There is no tuple in t2 for which b2=2.

Copy link
Contributor Author

@nsyca nsyca Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider this:

(null, 2) NOT IN { (1, 1), (null, 3), (1, null) }

which is equal to

.... AND (null <> 1 OR 2 <> null) => ... AND (unknown OR unknown)
                                  => ... AND unknown
                                  => unknown

Therefore (null, 2) is not part of the result set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah you are right. I was confusing this with the or rules.

struct<a1:int,b1:int>
-- !query 3 output
1 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.

-- !query 4 schema
struct<a1:int,b1:int>
-- !query 4 output
1 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an empty set without this fix.

if (!parent.exists()) {
assert(parent.mkdirs(), "Could not create directory: " + parent)
}
stringToFile(resultFile, goldenOutput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This newly added code is to address an issue, when test files are located in a hierarchy of sub-directories, at the time the golden result files are generated it could happen that the structure of those sub-directories are not yet created. The code will create the required sub-directories.

sql("select * from l where a not in (select c from t where b < d)"),
Row(1, 2.0) :: Row(1, 2.0) :: Row(3, 3.0) :: Nil)
sql("select * from l where (a, b) not in (select c, d from t) and a < 4"),
Row(1, 2.0) :: Row(1, 2.0) :: Row(2, 1.0) :: Row(2, 1.0) :: Row(3, 3.0) :: Nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query with correlated predicates in NOT IN subquery could generate incorrect results (this problem is tracked by SPARK-18966). With this fix, it reveals the problem. Here I modify the test case to cover the code path for multiple columns instead.

// Empty sub-query
checkAnswer(
sql("select * from l where a not in (select c from r where c > 10 and b < d)"),
sql("select * from l where (a, b) not in (select c, d from r where c > 10)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the predicate c > 10 (which filters all the rows in the subquery), it covers up the correlated predicate problem. Instead of removing the test case completely, I just modify to have a different coverage for multiple columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also should test an empty subquery :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is effectively covering the case of empty subquery.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 4, 2017

The new test file is placed in the directory structure similar to the one proposed in SPARK-18871 (#16337).

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70849 has finished for PR 16467 at commit de655d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

I left a comment on the JIRA. It feel it is important to reach consensus on the semantics first.

Other than that the PR looks good.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 4, 2017

Thank you for taking time to review the PR. I agree we need to reach consensus on the semantics.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I have one question regarding the result of a test case. Otherwise only minor things.

// Empty sub-query
checkAnswer(
sql("select * from l where a not in (select c from r where c > 10 and b < d)"),
sql("select * from l where (a, b) not in (select c, d from r where c > 10)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also should test an empty subquery :)

// to
// (a1=a2 OR isnull(a1=a2)) AND (b1=b2 OR isnull(b1=b2)) AND ...
val joinConds = splitConjunctivePredicates(joinCond.get)
val isNulls = joinConds.map(IsNull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - We can write this more directly:

val pairs = joinConds.map(c => Or(c, IsNull(c))).reduceLeft(And)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have changed the code.

struct<a1:int,b1:int>
-- !query 2 output
2 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is (null, 2) missing? There is no tuple in t2 for which b2=2.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70948 has finished for PR 16467 at commit f3f7773.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 9, 2017

@hvanhovell, please let me know if you are waiting for my replies to your comments. I believe I have addressed all of them. thanks!

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71283 has finished for PR 16467 at commit 6a1a415.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 16, 2017

@hvanhovell Would there be anything left that I have not addressed in this PR?

@hvanhovell
Copy link
Contributor

LGTM - merging to master/2.1. Thanks!

Sorry for the long wait.

asfgit pushed a commit that referenced this pull request Jan 24, 2017
…rn incorrect results

## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <[email protected]>

Closes #16467 from nsyca/19017.

(cherry picked from commit cdb691e)
Signed-off-by: Herman van Hovell <[email protected]>
@asfgit asfgit closed this in cdb691e Jan 24, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…rn incorrect results

## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16467 from nsyca/19017.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…rn incorrect results

## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16467 from nsyca/19017.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 16, 2017
…rn incorrect results

## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16467 from nsyca/19017.
@nsyca nsyca deleted the 19017 branch March 14, 2017 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants