-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: preserve more dictionaries when coercing types #10221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9585347
2051b53
7d49c0f
8db4a8e
7f64b4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -386,3 +386,53 @@ drop table m3; | |
|
|
||
| statement ok | ||
| drop table m3_source; | ||
|
|
||
|
|
||
| ## Test that filtering on dictionary columns coerces the filter value to the dictionary type | ||
| statement ok | ||
| create table test as values | ||
| ('row1', arrow_cast('1', 'Dictionary(Int32, Utf8)')), | ||
| ('row2', arrow_cast('2', 'Dictionary(Int32, Utf8)')), | ||
| ('row3', arrow_cast('3', 'Dictionary(Int32, Utf8)')) | ||
| ; | ||
|
|
||
| # query using an string '1' which must be coerced into a dictionary string | ||
| query T? | ||
| SELECT * from test where column2 = '1'; | ||
| ---- | ||
| row1 1 | ||
|
|
||
| # filter should not have a cast on column2 | ||
| query TT | ||
| explain SELECT * from test where column2 = '1'; | ||
| ---- | ||
| logical_plan | ||
| 01)Filter: test.column2 = Dictionary(Int32, Utf8("1")) | ||
| 02)--TableScan: test projection=[column1, column2] | ||
| physical_plan | ||
| 01)CoalesceBatchesExec: target_batch_size=8192 | ||
| 02)--FilterExec: column2@1 = 1 | ||
| 03)----MemoryExec: partitions=1, partition_sizes=[1] | ||
|
|
||
|
|
||
| # Now query using an integer which must be coerced into a dictionary string | ||
| query T? | ||
| SELECT * from test where column2 = 1; | ||
| ---- | ||
| row1 1 | ||
|
|
||
| # filter should not have a cast on column2 | ||
| query TT | ||
| explain SELECT * from test where column2 = 1; | ||
| ---- | ||
| logical_plan | ||
| 01)Filter: test.column2 = Dictionary(Int32, Utf8("1")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the key is that there is no CAST on |
||
| 02)--TableScan: test projection=[column1, column2] | ||
| physical_plan | ||
| 01)CoalesceBatchesExec: target_batch_size=8192 | ||
| 02)--FilterExec: column2@1 = 1 | ||
| 03)----MemoryExec: partitions=1, partition_sizes=[1] | ||
|
|
||
|
|
||
| statement ok | ||
| drop table test; | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3); | |
| [1, 2, 3] | ||
|
|
||
| # coalesce static empty value | ||
| query T | ||
| SELECT COALESCE('', 'test') | ||
| query TT | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated these tests to show what the coerced type was as well |
||
| SELECT COALESCE('', 'test'), | ||
| arrow_typeof(COALESCE('', 'test')) | ||
| ---- | ||
| (empty) | ||
| (empty) Utf8 | ||
|
|
||
| # coalesce static value with null | ||
| query T | ||
| SELECT COALESCE(NULL, 'test') | ||
| query TT | ||
| SELECT COALESCE(NULL, 'test'), | ||
| arrow_typeof(COALESCE(NULL, 'test')) | ||
| ---- | ||
| test | ||
|
|
||
| test Utf8 | ||
|
|
||
| # Create table with a dictionary value | ||
| statement ok | ||
| create table test1 as values (arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (null); | ||
|
|
||
| # test coercion string | ||
| query ? | ||
| select coalesce(column1, 'none_set') from test1; | ||
| # test coercion string (should preserve the dictionary type) | ||
| query ?T | ||
| select coalesce(column1, 'none_set'), | ||
| arrow_typeof(coalesce(column1, 'none_set')) | ||
| from test1; | ||
| ---- | ||
| foo | ||
| none_set | ||
| foo Dictionary(Int32, Utf8) | ||
| none_set Dictionary(Int32, Utf8) | ||
|
|
||
| # test coercion Int | ||
| query I | ||
| select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)')); | ||
| # test coercion Int and Dictionary | ||
| query ?T | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @erratic-pattern noted the difference here is that now that the output type of cealesce is |
||
| select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)')), | ||
| arrow_typeof(coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'))) | ||
| ---- | ||
| 34 | ||
| 34 Dictionary(Int32, Int64) | ||
|
|
||
| # test with Int | ||
| query I | ||
| select coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34); | ||
| query ?T | ||
| select coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34), | ||
| arrow_typeof(coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'),34)) | ||
| ---- | ||
| 123 | ||
| 123 Dictionary(Int32, Int64) | ||
|
|
||
| # test with null | ||
| query I | ||
| select coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)')); | ||
| query ?T | ||
| select coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)')), | ||
| arrow_typeof(coalesce(null, 34, arrow_cast(123, 'Dictionary(Int32, Int8)'))) | ||
| ---- | ||
| 34 | ||
| 34 Dictionary(Int32, Int64) | ||
|
|
||
| # test with null | ||
| query T | ||
| select coalesce(null, column1, 'none_set') from test1; | ||
| query ?T | ||
| select coalesce(null, column1, 'none_set'), | ||
| arrow_typeof(coalesce(null, column1, 'none_set')) | ||
| from test1; | ||
| ---- | ||
| foo | ||
| none_set | ||
| foo Dictionary(Int32, Utf8) | ||
| none_set Dictionary(Int32, Utf8) | ||
|
|
||
| statement ok | ||
| drop table test1 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason why this case doesn't also preserve the dictionary type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because value type is enough for comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not preserve dict for comparison overall 🤔 ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we need to solve the issue is avoiding
casting from value to dict for column, because casting for column is costly compare with casting constant.Given the example, if we preserve dict, we still ends up casting column (utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to utf8 and it is enough.
expect plan
I think we should not preserve dict, but need specialization on comparing dict vs non-dict case.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that looks bad. It should be unwrapped.
Thank you for that example 💯
Maybe we could extend
datafusion/datafusion/optimizer/src/unwrap_cast_in_comparison.rs
Line 160 in a5ce568
CAST(col, ..) = constfor other datatypes 🤔I can try to do so later this weekend. Or would you like to try it @erratic-pattern ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I tried the example from @jayzhan211 on main and it doesn't put a cast on the filter -- thus I agree this PR would be a regression if merged as is. I will dismiss my review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jayzhan211 that we probably don't want to cast everything to dictionaries in the way that we are currently doing it, and what we really want is a way for the optimizer to avoid expensive casts of dictionary columns, and more generally to just avoid column casts in favor of casting constants and scalar expressions.
I think what we have works fine for now and fixes performance issues we're seeing on dictionary columns, but should be improved for the general case in subsequent PRs that redesign the type coercion logic.