Skip to content

Conversation

@ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented May 8, 2023

To build on top of some of the work in #1130, this pr adds the conjunctive_filters and disjunctive_filters arguments to attempt_predicate_pushdown.
This allow applying the provided arguments in addition to everything that can be inferred from the HLG, allowing us to expand the predicate pushdown coverage to cases that cannot be inferred from the HLG.

Ideally in the future we'd expand the table_scan filter logic to generate a dnf instead of a hlg to better express more kinds of complex predicates.

@ayushdg
Copy link
Collaborator Author

ayushdg commented May 8, 2023

@rjzamora Since you'd initially worked on some of the code here it would be great if you could review as well.
I might've missed some edge cases so happy to update if you find something.

@codecov-commenter
Copy link

codecov-commenter commented May 8, 2023

Codecov Report

Merging #1138 (ca9172a) into main (3859629) will increase coverage by 0.13%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1138      +/-   ##
==========================================
+ Coverage   81.42%   81.55%   +0.13%     
==========================================
  Files          78       78              
  Lines        4392     4396       +4     
  Branches      796      798       +2     
==========================================
+ Hits         3576     3585       +9     
+ Misses        639      630       -9     
- Partials      177      181       +4     
Impacted Files Coverage Δ
dask_sql/physical/utils/filter.py 78.36% <100.00%> (+0.51%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these changes make sense @ayushdg - Thanks for the ping!

I left some thoughts about possible corner cases, but there is a decent chance that those corner cases are either imagined or unimportant.

assert PluginTest1().get_plugin("some_key") == "value_2"


def test_predicate_pushdown(parquet_ddf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems unnecessarily intimidating to me :)

Perhaps this should be broken into several different tests?

Comment on lines +98 to +102
# Expand the filter set with provided conjunctive and disjunctive filters
filters.extend([f] for f in disjunctive_filters or [])
# Add conjunctive filters to each disjunctive filter
for f in filters:
f.extend(conjunctive_filters or [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, and I think I (mostly) like the approach. However, it does seem possible that you will eventually want to do things that are currently prohibited. Some examples:

  1. You want to make a disjunctive addition to your filters, and the new filter contains a conjunction that should not be added to any of the other (existing) filters. For example, you want to apply the existing filters, but you also want to allow null values for a specific user-id. (does this already work?)
  2. You want to add a conjunctive filter to a subset of the pre-existing disjunctive filters.
  3. You want to drop pre-existing filters.

Not sure if any of these cases are important or meaningful at all. Either way, I'm pretty sure you could support (1) and (3) later if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. You're right that all three cases aren't really support with the current addition and it only helps for passing additional filters that get applied to everything.
Ideally I'd like to consolidate the conjunctive and disjunctive args into a single arg in dnf form which allows some flexibility with the additional filters but doesn't allow good interactivity with existing ones.

While we're unable to return a full dnf for a dataframe rather than having some existing filters coming via the hlg and some passed in additionally, I can't think of a nice way for both to interact cleanly without some good tracking around how we represent that information.

@ayushdg
Copy link
Collaborator Author

ayushdg commented Jun 21, 2023

Closing in favor of #1173

@ayushdg ayushdg closed this Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants