Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions datafusion/expr/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -638,22 +638,26 @@ fn dictionary_coercion(
preserve_dictionaries: bool,
) -> Option<DataType> {
use arrow::datatypes::DataType::*;
let maybe_preserve_dictionaries =
|index_type: &DataType, value_type: DataType| -> DataType {
if preserve_dictionaries {
Dictionary(Box::new(index_type.clone()), Box::new(value_type))
} else {
value_type
}
};
match (lhs_type, rhs_type) {
(
Dictionary(_lhs_index_type, lhs_value_type),
Dictionary(_rhs_index_type, rhs_value_type),
) => comparison_coercion(lhs_value_type, rhs_value_type),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason why this case doesn't also preserve the dictionary type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because value type is enough for comparison

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not preserve dict for comparison overall 🤔 ?

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need to solve the issue is avoiding casting from value to dict for column, because casting for column is costly compare with casting constant.

Given the example, if we preserve dict, we still ends up casting column (utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to utf8 and it is enough.

statement ok
create table test as values
  ('row1', arrow_cast(1, 'Utf8')),
  ('row2', arrow_cast(2, 'Utf8')),
  ('row3', arrow_cast(3, 'Utf8'))
;

# query using an string '1' which must be coerced into a dictionary string
query TT
SELECT * from test where column2 =  arrow_cast(2, 'Dictionary(Int32, Int64)');
----
row2 2

query TT
explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');
----
logical_plan
01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, Utf8("2"))
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]

statement ok
drop table test;

expect plan

01)Filter: test.column2 = Utf8("2")
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column2@1 = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]

I think we should not preserve dict, but need specialization on comparing dict vs non-dict case.

Copy link
Contributor

@alamb alamb Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2

Yes I agree that looks bad. It should be unwrapped.

Thank you for that example 💯

Maybe we could extend

match (left.as_mut(), right.as_mut()) {
which handles CAST(col, ..) = const for other datatypes 🤔

I can try to do so later this weekend. Or would you like to try it @erratic-pattern ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I tried the example from @jayzhan211 on main and it doesn't put a cast on the filter -- thus I agree this PR would be a regression if merged as is. I will dismiss my review

DataFusion CLI v37.1.0
> create table test as values
  ('row1', arrow_cast(1, 'Utf8')),
  ('row2', arrow_cast(2, 'Utf8')),
  ('row3', arrow_cast(3, 'Utf8'))
;
0 row(s) fetched.
Elapsed 0.050 seconds.

> explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');

+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: test.column2 = Utf8("2")                  |
|               |   TableScan: test projection=[column1, column2]   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: column2@1 = 2                       |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.053 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that we probably don't want to cast everything to dictionaries in the way that we are currently doing it, and what we really want is a way for the optimizer to avoid expensive casts of dictionary columns, and more generally to just avoid column casts in favor of casting constants and scalar expressions.

I think what we have works fine for now and fixes performance issues we're seeing on dictionary columns, but should be improved for the general case in subsequent PRs that redesign the type coercion logic.

(d @ Dictionary(_, value_type), other_type)
| (other_type, d @ Dictionary(_, value_type))
if preserve_dictionaries && value_type.as_ref() == other_type =>
{
Some(d.clone())
}
(Dictionary(_index_type, value_type), _) => {
(Dictionary(index_type, value_type), _) => {
comparison_coercion(value_type, rhs_type)
.map(|new_type| maybe_preserve_dictionaries(index_type, new_type))
}
(_, Dictionary(_index_type, value_type)) => {
(_, Dictionary(index_type, value_type)) => {
comparison_coercion(lhs_type, value_type)
.map(|new_type| maybe_preserve_dictionaries(index_type, new_type))
}
_ => None,
}
Expand Down
50 changes: 50 additions & 0 deletions datafusion/sqllogictest/test_files/dictionary.slt
Original file line number Diff line number Diff line change
Expand Up @@ -386,3 +386,53 @@ drop table m3;

statement ok
drop table m3_source;


## Test that filtering on dictionary columns coerces the filter value to the dictionary type
statement ok
create table test as values
('row1', arrow_cast('1', 'Dictionary(Int32, Utf8)')),
('row2', arrow_cast('2', 'Dictionary(Int32, Utf8)')),
('row3', arrow_cast('3', 'Dictionary(Int32, Utf8)'))
;

# query using an string '1' which must be coerced into a dictionary string
query T?
SELECT * from test where column2 = '1';
----
row1 1

# filter should not have a cast on column2
query TT
explain SELECT * from test where column2 = '1';
----
logical_plan
01)Filter: test.column2 = Dictionary(Int32, Utf8("1"))
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column2@1 = 1
03)----MemoryExec: partitions=1, partition_sizes=[1]


# Now query using an integer which must be coerced into a dictionary string
query T?
SELECT * from test where column2 = 1;
----
row1 1

# filter should not have a cast on column2
query TT
explain SELECT * from test where column2 = 1;
----
logical_plan
01)Filter: test.column2 = Dictionary(Int32, Utf8("1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key is that there is no CAST on column2 here

02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column2@1 = 1
03)----MemoryExec: partitions=1, partition_sizes=[1]


statement ok
drop table test;
61 changes: 35 additions & 26 deletions datafusion/sqllogictest/test_files/scalar.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3);
[1, 2, 3]

# coalesce static empty value
query T
SELECT COALESCE('', 'test')
query TT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these tests to show what the coerced type was as well

SELECT COALESCE('', 'test'),
arrow_typeof(COALESCE('', 'test'))
----
(empty)
(empty) Utf8

# coalesce static value with null
query T
SELECT COALESCE(NULL, 'test')
query TT
SELECT COALESCE(NULL, 'test'),
arrow_typeof(COALESCE(NULL, 'test'))
----
test

test Utf8

# Create table with a dictionary value
statement ok
create table test1 as values (arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (null);

# test coercion string
query ?
select coalesce(column1, 'none_set') from test1;
# test coercion string (should preserve the dictionary type)
query ?T
select coalesce(column1, 'none_set'),
arrow_typeof(coalesce(column1, 'none_set'))
from test1;
----
foo
none_set
foo Dictionary(Int32, Utf8)
none_set Dictionary(Int32, Utf8)

# test coercion Int
query I
select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
# test coercion Int and Dictionary
query ?T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @erratic-pattern noted the difference here is that now that the output type of cealesce is Dictionary rather than Int.

select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)')),
arrow_typeof(coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)')))
----
34
34 Dictionary(Int32, Int64)

# test with Int
query I
select coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34);
query ?T
select coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34),
arrow_typeof(coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34))
----
123
123 Dictionary(Int32, Int64)

# test with null
query I
select coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
query ?T
select coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)')),
arrow_typeof(coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)')))
----
34
34 Dictionary(Int32, Int64)

# test with null
query T
select coalesce(null, column1, 'none_set') from test1;
query ?T
select coalesce(null, column1, 'none_set'),
arrow_typeof(coalesce(null, column1, 'none_set'))
from test1;
----
foo
none_set
foo Dictionary(Int32, Utf8)
none_set Dictionary(Int32, Utf8)

statement ok
drop table test1
Expand Down