Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 22, 2021

What changes were proposed in this pull request?

This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is

Join(b#1 = 3)
  TableScan(t, [a#0, b#1])
  Project(a#2)
    TableScan(t, [a#2, b#3])

It's a self-join, but b#1 is not ambiguous because it can't come from the right side, which only has column a.

Why are the changes needed?

to not fail valid self-join queries.

Does this PR introduce any user-facing change?

yea as a bug fix

How was this patch tested?

a new test

@github-actions github-actions bot added the SQL label Jan 22, 2021
@cloud-fan
Copy link
Contributor Author

cc @Ngone51 @viirya @maropu

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@viirya
Copy link
Member

viirya commented Jan 22, 2021

BTW, the Join example in the PR description is not aligned properly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38939/

@cloud-fan
Copy link
Contributor Author

Ah actually this fixes a 3.1 regression. Previously we merged a bug fix: #30488 , which makes sure ambiguous self-join check is always applied. That said, in 3.0 the ambiguous self-join check is skipped under some cases, which hides the bug this PR is fixing.

For the query below, it works in 3.0, but fails in 3.1. After this PR, it works again.

sql("create table t1 using json as select 1 key, 1 value")
sql("create table t2 using json as select 1 key, 2 value")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val t3 = t1.join(t2, t1("key") === t2("key")).select(t1("value"))
t1.join(t3, t1("key") > 1)

cc @HyukjinKwon @dongjoon-hyun I think it's a 3.1.1 blocker.

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38939/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134353 has finished for PR 31287 at commit 2c34ade.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Jan 22, 2021
…te availability

### What changes were proposed in this pull request?

This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is
```
Join(b#1 = 3)
  TableScan(t, [a#0, b#1])
  Project(a#2)
    TableScan(t, [a#2, b#3])
```
It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`.

### Why are the changes needed?

to not fail valid self-join queries.

### Does this PR introduce _any_ user-facing change?

yea as a bug fix

### How was this patch tested?

a new test

Closes #31287 from cloud-fan/self-join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit b8a6906)
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 22, 2021
…te availability

This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is
```
Join(b#1 = 3)
  TableScan(t, [a#0, b#1])
  Project(a#2)
    TableScan(t, [a#2, b#3])
```
It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`.

to not fail valid self-join queries.

yea as a bug fix

a new test

Closes #31287 from cloud-fan/self-join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit b8a6906)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master, branch-3.1 and branch-3.0.

skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…te availability

### What changes were proposed in this pull request?

This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is
```
Join(b#1 = 3)
  TableScan(t, [a#0, b#1])
  Project(a#2)
    TableScan(t, [a#2, b#3])
```
It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`.

### Why are the changes needed?

to not fail valid self-join queries.

### Does this PR introduce _any_ user-facing change?

yea as a bug fix

### How was this patch tested?

a new test

Closes apache#31287 from cloud-fan/self-join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants