-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
SELECT .. WHERE NULL IN (1,2,3) and SELECT ... WHERE x in (NULL, 2, 3) are always NULL (and thus will filter out all rows). However, DataFusion will still try and evaluate a predicate:
DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.Note there is a FilterExec with a non trivial expression in both of the following queries:
❯ explain select x from t where x IN (null, 2, 3);
+---------------+---------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------+
| logical_plan | Filter: t.x = Int32(NULL) OR t.x = Int32(2) OR t.x = Int32(3) |
| | TableScan: t projection=[x] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: x@0 = NULL OR x@0 = 2 OR x@0 = 3 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.
❯ explain select x from t where null IN (x, 2, 3);
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Filter: Int64(NULL) IN ([CAST(t.x AS Int64), Int64(2), Int64(3)]) |
| | TableScan: t projection=[x] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: NULL IN ([CastExpr { expr: Column { name: "x", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false, format_options: FormatOptions { safe: true, null: "", date_format: None, datetime_format: None, timestamp_format: None, timestamp_tz_format: None, time_format: None, duration_format: Pretty } } }, Literal { value: Int64(2) }, Literal { value: Int64(3) }]) |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.In both cases the predicate could have been reduced to a single NULL
Describe the solution you'd like
I would like to extend the ExprSimplifier rules to handle the case of NULL IN (...) and when the InList contains NULL
Here are some similar rules
https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L474-L549
Describe alternatives you've considered
No response
Additional context
These types of expressions are sometimes generated programmatically during rewrites in IOx.
It also came up with discussions with @yahoNanJing on #8669
I think this would be a good first issue as the patterns exist already and the need is well defined.