[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join #44193

bersprockets · 2023-12-05T23:34:52Z

What changes were proposed in this pull request?

In RewritePredicateSubquery, prune existence flags from the final join when rewriteExistentialExpr returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a Project node.

For example:

Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]

becomes

Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]

This change always adds the Project node, whether rewriteExistentialExpr returns an existence join or not. In the case when rewriteExistentialExpr does not return an existence join, RemoveNoopOperators will remove the unneeded Project node.

Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:

create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true

(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:

select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis

Does this PR introduce any user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

bersprockets · 2023-12-06T01:54:52Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+      Seq((1), (2), (3)).toDF("c1").persist().createOrReplaceTempView("t2")
+      Seq((3), (9)).toDF("col1").persist().createOrReplaceTempView("t3")
+
+      val query1 =


I mentioned in the description that this particular query will not show the extraneous boolean column when executed via the Dataset API. However, when Utils.isTesting is true, the RuleExecutor will notice that the rule has changed the query's schema and throw an exception.

Thank you, @bersprockets !

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-12-06T18:56:23Z

Merged to master. There is a conflict on code part in old branches.

Could you make backporting PRs for all live release branches, please, @bersprockets ?

branch-3.5
branch-3.4
branch-3.3

BTW, I raised SPARK-45580 as a blocker for Apache Spark 3.3.4 (which is targeting next Monday).

bersprockets · 2023-12-06T20:11:49Z

@dongjoon-hyun working on the back-ports.

dongjoon-hyun · 2023-12-06T20:12:30Z

Thank you!

…stence join In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` No, except for the removal of the extraneous boolean flag and the fix to the error condition. New unit test. No. Closes apache#44193 from bersprockets/schema_change. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join ### What changes were proposed in this pull request? This is a back-port of #44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44215 from bersprockets/schema_change_br35. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join This is a back-port of apache#44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` No, except for the removal of the extraneous boolean flag and the fix to the error condition. New unit test. No. Closes apache#44215 from bersprockets/schema_change_br35. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join ### What changes were proposed in this pull request? This is a back-port of #44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44219 from bersprockets/schema_change_br34. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join ### What changes were proposed in this pull request? This is a back-port of #44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44223 from bersprockets/schema_change_br33. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…stence join ### What changes were proposed in this pull request? In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44193 from bersprockets/schema_change. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join ### What changes were proposed in this pull request? This is a back-port of apache#44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44219 from bersprockets/schema_change_br34. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n existence join (apache#362) ### What changes were proposed in this pull request? This is a back-port of apache#44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44215 from bersprockets/schema_change_br35. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Bruce Robbins <[email protected]>

bersprockets added 2 commits December 5, 2023 13:22

Tests and Fix

87a1ff5

Update test name

098fb1e

github-actions bot added the SQL label Dec 5, 2023

Empty commit for new SHA

2222ed2

bersprockets commented Dec 6, 2023

View reviewed changes

dongjoon-hyun approved these changes Dec 6, 2023

View reviewed changes

dongjoon-hyun closed this in c96fef2 Dec 6, 2023

bersprockets mentioned this pull request Dec 6, 2023

[SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join #44215

Closed

bersprockets mentioned this pull request Dec 6, 2023

[SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join #44219

Closed

bersprockets mentioned this pull request Dec 7, 2023

[SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join #44223

Closed

bersprockets deleted the schema_change branch December 17, 2023 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join #44193

[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join #44193

Uh oh!

bersprockets commented Dec 5, 2023

Uh oh!

bersprockets Dec 6, 2023

Uh oh!

dongjoon-hyun Dec 6, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Dec 6, 2023 •

edited

Loading

Uh oh!

bersprockets commented Dec 6, 2023

Uh oh!

dongjoon-hyun commented Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join #44193

[SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join #44193

Uh oh!

Conversation

bersprockets commented Dec 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

bersprockets Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bersprockets commented Dec 6, 2023

Uh oh!

dongjoon-hyun commented Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dongjoon-hyun commented Dec 6, 2023 •

edited

Loading