Skip to content

[QST] Remove SQL compatibility null handling from join code? #424

@charlesbluca

Description

@charlesbluca

What is your question?

TLDR - there's some costly null handing in our join code to maintain SQL compatibility - should we remove this like we did with groupby null handling?

In #290, we had some discussion around the extent to which Dask-SQL should perform costly workarounds to maintain SQL compatibility - in that case, it was null-splitting operations happening for groupbys to maintain a specific null ordering, and we opted to remove this in general cases in #273.

Following up on that, another area where I notice we are doing costly workarounds to maintain SQL null handling is our _join_on_columns utility - in particular, we run an isna on all the join-by columns, picking out all null values so they won't return in the result:

if join_type in ["inner", "right"]:
df_lhs_filter = reduce(
operator.and_,
[~df_lhs_renamed.iloc[:, index].isna() for index in lhs_on],
)
df_lhs_renamed = df_lhs_renamed[df_lhs_filter]
if join_type in ["inner", "left"]:
df_rhs_filter = reduce(
operator.and_,
[~df_rhs_renamed.iloc[:, index].isna() for index in rhs_on],
)
df_rhs_renamed = df_rhs_renamed[df_rhs_filter]

Removing this check reduces our HLG for a standard join by a decent amount:

import cudf
import dask.dataframe as dd
from dask_sql import Context

c = Context()

df1 = cudf.DataFrame({"a": [1, 2, 3, 4, 5], "b": [6, None, 7, None, 8]})
df2 = cudf.DataFrame({"c": [6, 7, None, None, None], "d": [11, 12, 13, 14, 15]})

c.create_table("lhs", df1)
c.create_table("rhs", df2)

c.sql("SELECT * FROM lhs JOIN rhs ON b = c")

With null handling:

Image of Dask task graph laying out a join operation in Dask-SQL - this graph includes several null checks before performing the join operation

And without:

Similar Dask task graph to the one above, with the null checks missing

And similar to GROUP BY statements, there is a way for users to achieve a SQL compatible result, by performing a follow up WHERE filter after the initial JOIN to remove the null rows.

I'm interested in thoughts on removing this code, and modifying our compatibility tests to use the user workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions