Skip to content

Conversation

@rjzamora
Copy link
Contributor

Closes #607

This PR adds predicate-pushdown support for dd._Frame.isin(values), and uses a somewhat-ugly workaround to account for the fact that dask.dataframe will produce a distinct MaterializedLayer object for the values argument to isin.

Note that this PR also addresses "not in" support. However, dask.dataframe will produce the wrong result for these filters until dask/dask#10320 is merged.

@rjzamora rjzamora marked this pull request as draft May 26, 2023 21:02
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2023

Codecov Report

Merging #1160 (5193718) into main (00aea43) will increase coverage by 0.22%.
The diff coverage is 83.78%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #1160      +/-   ##
==========================================
+ Coverage   81.30%   81.53%   +0.22%     
==========================================
  Files          78       78              
  Lines        4414     4474      +60     
  Branches      799      817      +18     
==========================================
+ Hits         3589     3648      +59     
+ Misses        649      641       -8     
- Partials      176      185       +9     
Impacted Files Coverage Δ
dask_sql/physical/utils/filter.py 80.88% <83.33%> (+3.04%) ⬆️
dask_sql/_compat.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rjzamora rjzamora marked this pull request as ready for review June 1, 2023 19:15
@rjzamora rjzamora changed the title [WIP] Support predicate-pushdown for isin operations Support predicate-pushdown for isin operations Jun 1, 2023
@rjzamora
Copy link
Contributor Author

rjzamora commented Jun 1, 2023

@ayushdg @randerzander - Note that I added isna support to this PR, but I'd be happy to split that change into a separate PR if you prefer.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally lgtm!
Doesn't need to be in this pr, but as a followup it might also make sense to add simpler unit tests that just use the attempt predicate pushdown method on dask dataframes and add them to https://github.com/dask-contrib/dask-sql/tree/main/tests/unit

@rjzamora
Copy link
Contributor Author

rjzamora commented Jun 6, 2023

@ayushdg - I'm thinking we should layer on something like #1140 as a follow up. Let me know what you think.

@ayushdg
Copy link
Collaborator

ayushdg commented Jun 6, 2023

Yeah it makes sense to add something that conceptually achieves what #1140 does without some of the extra logic/handling in that pr. We can address that in a followup and some of the logic from #1140 should be redundant with dask-cudf 22.06 and newer

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this @rjzamora!

@ayushdg ayushdg merged commit 51cd0c8 into dask-contrib:main Jun 7, 2023
@rjzamora rjzamora deleted the isin-filtering branch June 7, 2023 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DF] [ENH] Add series.isin support for predicate pushdown

3 participants