-
Notifications
You must be signed in to change notification settings - Fork 72
Description
What is your question?
TLDR - there's some costly null handing in our join code to maintain SQL compatibility - should we remove this like we did with groupby null handling?
In #290, we had some discussion around the extent to which Dask-SQL should perform costly workarounds to maintain SQL compatibility - in that case, it was null-splitting operations happening for groupbys to maintain a specific null ordering, and we opted to remove this in general cases in #273.
Following up on that, another area where I notice we are doing costly workarounds to maintain SQL null handling is our _join_on_columns utility - in particular, we run an isna on all the join-by columns, picking out all null values so they won't return in the result:
dask-sql/dask_sql/physical/rel/logical/join.py
Lines 215 to 226 in 5beaf35
| if join_type in ["inner", "right"]: | |
| df_lhs_filter = reduce( | |
| operator.and_, | |
| [~df_lhs_renamed.iloc[:, index].isna() for index in lhs_on], | |
| ) | |
| df_lhs_renamed = df_lhs_renamed[df_lhs_filter] | |
| if join_type in ["inner", "left"]: | |
| df_rhs_filter = reduce( | |
| operator.and_, | |
| [~df_rhs_renamed.iloc[:, index].isna() for index in rhs_on], | |
| ) | |
| df_rhs_renamed = df_rhs_renamed[df_rhs_filter] |
Removing this check reduces our HLG for a standard join by a decent amount:
import cudf
import dask.dataframe as dd
from dask_sql import Context
c = Context()
df1 = cudf.DataFrame({"a": [1, 2, 3, 4, 5], "b": [6, None, 7, None, 8]})
df2 = cudf.DataFrame({"c": [6, 7, None, None, None], "d": [11, 12, 13, 14, 15]})
c.create_table("lhs", df1)
c.create_table("rhs", df2)
c.sql("SELECT * FROM lhs JOIN rhs ON b = c")With null handling:
And without:
And similar to GROUP BY statements, there is a way for users to achieve a SQL compatible result, by performing a follow up WHERE filter after the initial JOIN to remove the null rows.
I'm interested in thoughts on removing this code, and modifying our compatibility tests to use the user workaround.

