Skip to content

Conversation

@bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

This is a back-port of #44193.

In RewritePredicateSubquery, prune existence flags from the final join when rewriteExistentialExpr returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a Project node.

For example:

Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]

becomes

Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]

This change always adds the Project node, whether rewriteExistentialExpr returns an existence join or not. In the case when rewriteExistentialExpr does not return an existence join, RemoveNoopOperators will remove the unneeded Project node.

Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:

create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true

(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:

select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis

Does this PR introduce any user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

…stence join

In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true
```
(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

New unit test.

No.

Closes apache#44193 from bersprockets/schema_change.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@github-actions github-actions bot added the SQL label Dec 6, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

dongjoon-hyun pushed a commit that referenced this pull request Dec 6, 2023
…n existence join

### What changes were proposed in this pull request?

This is a back-port of #44193.

In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true
```
(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44215 from bersprockets/schema_change_br35.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to branch-3.5 for Apache Spark 3.5.1.

@dongjoon-hyun
Copy link
Member

Could you make branch-3.4 PR and branch-3.3 PR, @bersprockets ?

bersprockets added a commit to bersprockets/spark that referenced this pull request Dec 6, 2023
…n existence join

This is a back-port of apache#44193.

In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true
```
(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

New unit test.

No.

Closes apache#44215 from bersprockets/schema_change_br35.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
bersprockets added a commit to bersprockets/spark that referenced this pull request Dec 7, 2023
…n existence join

This is a back-port of apache#44193.

In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true
```
(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

New unit test.

No.

Closes apache#44215 from bersprockets/schema_change_br35.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@bersprockets bersprockets deleted the schema_change_br35 branch December 17, 2023 15:45
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…n existence join (apache#362)

### What changes were proposed in this pull request?

This is a back-port of apache#44193.

In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.

For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
:  :- LocalRelation [a#13]
:  +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.

### Why are the changes needed?

This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);

select *
from t1
where exists (
  select c1
  from t2
  where a = c1
  or a in (select col1 from t3)
);

1	false
2	false
3	true
```
(Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).

This query fails when run in either spark-sql or using the Dataset API:
```
select (
  select *
  from t1
  where exists (
    select c1
    from t2
    where a = c1
    or a in (select col1 from t3)
  )
  limit 1
)
from range(1);

java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
```

### Does this PR introduce _any_ user-facing change?

No, except for the removal of the extraneous boolean flag and the fix to the error condition.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#44215 from bersprockets/schema_change_br35.

Authored-by: Bruce Robbins <[email protected]>

Signed-off-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Bruce Robbins <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants