Skip to content

PushDownFilter incorrectly optimises filter criteria with <=, >= #18214

@m09526

Description

@m09526

Describe the bug

As of DataFusion v49.0.0, the PushDownFilter optimizer rule attempts to simplify and optimize redundant filter conditions on a column. For example filtering on "a > 10 AND a > 20" can be simpified to "a > 20". This was introduced in #16362.

However, the case when two filters with the same bounds but different relational operators is handled incorrectly.

In the example below, the conditions a > 1 AND a < 10 AND a >=1 AND a <= 10 are reduced incorrectly to a >= 1 AND a <= 10. Converting the < and > to <= and >= incorrectly match more rows than they should.

Note: We found that changing the ordering of the filter lines affects whether the bug occurs. If the two filtering lines are swapped, the correct filter of a > 1 AND a < 10 is produced.

To Reproduce

With DataFusion V50.2.0, the following code:

use datafusion::{error::DataFusionError, prelude::*};

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let df = dataframe!["a" => [1,2,3,4], "b" => ["1","2","3","4"]]?;
    let df = df.filter(col("a").gt(lit(1)).and(col("a").lt(lit(10))))?;
    let df = df.filter(col("a").gt_eq(lit(1)).and(col("a").lt_eq(lit(10))))?;
    let t = df.explain(false, false)?;
    t.show().await?;
    Ok(())
}

produces the following incorrect output:

+---------------+----------------------------------------------------------+
| plan_type     | plan                                                     |
+---------------+----------------------------------------------------------+
| logical_plan  | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) |
|               |   TableScan: ?table? projection=[a, b]                   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192              |
|               |   FilterExec: a@0 >= 1 AND a@0 <= 10                     |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]    |
|               |                                                          |
+---------------+----------------------------------------------------------+

Expected behavior

The correct filter conditions would be:

+---------------+----------------------------------------------------------+
| plan_type     | plan                                                     |
+---------------+----------------------------------------------------------+
| logical_plan  | Filter: ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
|               |   TableScan: ?table? projection=[a, b]                   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192              |
|               |   FilterExec: a@0 > 1 AND a@0 < 10                     |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]    |
|               |                                                          |
+---------------+----------------------------------------------------------+

Additional context

In DataFusion V48 and below, we see the following output:

+---------------+-------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                        |
+---------------+-------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) AND ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
|               |   TableScan: ?table? projection=[a, b]                                                                      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                 |
|               |   FilterExec: a@0 >= 1 AND a@0 <= 10 AND a@0 > 1 AND a@0 < 10                                               |
|               |     DataSourceExec: partitions=1, partition_sizes=[1]                                                       |
|               |                                                                                                             |
+---------------+-------------------------------------------------------------------------------------------------------------+

The filtering conditions are correct, if redundant.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions