-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
As of DataFusion v49.0.0, the PushDownFilter optimizer rule attempts to simplify and optimize redundant filter conditions on a column. For example filtering on "a > 10 AND a > 20" can be simpified to "a > 20". This was introduced in #16362.
However, the case when two filters with the same bounds but different relational operators is handled incorrectly.
In the example below, the conditions a > 1 AND a < 10 AND a >=1 AND a <= 10 are reduced incorrectly to a >= 1 AND a <= 10. Converting the < and > to <= and >= incorrectly match more rows than they should.
Note: We found that changing the ordering of the filter lines affects whether the bug occurs. If the two filtering lines are swapped, the correct filter of a > 1 AND a < 10 is produced.
To Reproduce
With DataFusion V50.2.0, the following code:
use datafusion::{error::DataFusionError, prelude::*};
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let df = dataframe!["a" => [1,2,3,4], "b" => ["1","2","3","4"]]?;
let df = df.filter(col("a").gt(lit(1)).and(col("a").lt(lit(10))))?;
let df = df.filter(col("a").gt_eq(lit(1)).and(col("a").lt_eq(lit(10))))?;
let t = df.explain(false, false)?;
t.show().await?;
Ok(())
}
produces the following incorrect output:
+---------------+----------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------+
| logical_plan | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) |
| | TableScan: ?table? projection=[a, b] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: a@0 >= 1 AND a@0 <= 10 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------+
Expected behavior
The correct filter conditions would be:
+---------------+----------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------+
| logical_plan | Filter: ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
| | TableScan: ?table? projection=[a, b] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: a@0 > 1 AND a@0 < 10 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------+
Additional context
In DataFusion V48 and below, we see the following output:
+---------------+-------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------+
| logical_plan | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) AND ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
| | TableScan: ?table? projection=[a, b] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: a@0 >= 1 AND a@0 <= 10 AND a@0 > 1 AND a@0 < 10 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------+
The filtering conditions are correct, if redundant.