[QST] Remove SQL compatibility null handling from join code?

**What is your question?**

_TLDR - there's some costly null handing in our join code to maintain SQL compatibility - should we remove this like we did with groupby null handling?_

In #290, we had some discussion around the extent to which Dask-SQL should perform costly workarounds to maintain SQL compatibility - in that case, it was null-splitting operations happening for groupbys to maintain a specific null ordering, and we opted to remove this in general cases in #273.

Following up on that, another area where I notice we are doing costly workarounds to maintain SQL null handling is our `_join_on_columns` utility - in particular, we run an `isna` on all the join-by columns, picking out all null values so they won't return in the result:

https://github.com/dask-contrib/dask-sql/blob/5beaf35c7bdebe6626d3cf8d7217a00672b0bdbe/dask_sql/physical/rel/logical/join.py#L215-L226

Removing this check reduces our HLG for a standard join by a decent amount:

```python
import cudf
import dask.dataframe as dd
from dask_sql import Context

c = Context()

df1 = cudf.DataFrame({"a": [1, 2, 3, 4, 5], "b": [6, None, 7, None, 8]})
df2 = cudf.DataFrame({"c": [6, 7, None, None, None], "d": [11, 12, 13, 14, 15]})

c.create_table("lhs", df1)
c.create_table("rhs", df2)

c.sql("SELECT * FROM lhs JOIN rhs ON b = c")
```

With null handling:

![Image of Dask task graph laying out a join operation in Dask-SQL - this graph includes several null checks before performing the join operation](https://user-images.githubusercontent.com/20627856/157893480-b3ccb76f-2f96-4eb2-8efa-a87e7c08f4e1.png)

And without:

![Similar Dask task graph to the one above, with the null checks missing](https://user-images.githubusercontent.com/20627856/157893654-5fe160e3-ff59-406d-98ef-36e510406f78.png)

And similar to GROUP BY statements, there is a way for users to achieve a SQL compatible result, by performing a follow up WHERE filter after the initial JOIN to remove the null rows.

I'm interested in thoughts on removing this code, and modifying our compatibility tests to use the user workaround.

	if join_type in ["inner", "right"]:
	df_lhs_filter = reduce(
	operator.and_,
	[~df_lhs_renamed.iloc[:, index].isna() for index in lhs_on],
	)
	df_lhs_renamed = df_lhs_renamed[df_lhs_filter]
	if join_type in ["inner", "left"]:
	df_rhs_filter = reduce(
	operator.and_,
	[~df_rhs_renamed.iloc[:, index].isna() for index in rhs_on],
	)
	df_rhs_renamed = df_rhs_renamed[df_rhs_filter]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Remove SQL compatibility null handling from join code? #424

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Remove SQL compatibility null handling from join code? #424

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions