-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support Utf8View for string function bit_length()
#13221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Utf8View for string function bit_length()
#13221
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @austin362667 🙏
Can you please add a test to https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/string_query.slt.part ?
The structure is explained in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/README.md
|
certainly, thanks for review! |
| p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL | ||
| NULL NULL NULL NULL NULL NULL | ||
|
|
||
| # TODO: Support Utf8View for bit_length array string function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment -- it seems like bit_length is working on Utf8View in datafusion/sqllogictest/test_files/string/string_view.slt
If you just uncomment this code out that would mean we are covering bit_length for all the string types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb
As I understand it, applying bit_length() to a column will currently fail because arrow::compute::kernels::length::bit_length does not yet support UTF-8 views (depends on PR in arrow-rs). However, it's still used in the following code path:
| ColumnarValue::Array(v) => Ok(ColumnarValue::Array(bit_length(v.as_ref())?)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @austin362667
| query IIII | ||
| SELECT | ||
| BIT_LENGTH(arrow_cast('Andrew', 'Utf8View')), | ||
| BIT_LENGTH(arrow_cast('datafusion数据融合', 'Utf8View')), | ||
| BIT_LENGTH(arrow_cast('💖', 'Utf8View')), | ||
| BIT_LENGTH(arrow_cast('josé', 'Utf8View')) | ||
| ; | ||
| ---- | ||
| 48 176 32 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move this test to string/string_literal.slt and then add similar tests for other string types as well (UTF8, LargeUTF8, and DictionaryString).
| p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL | ||
| NULL NULL NULL NULL NULL NULL | ||
|
|
||
| # TODO: Support Utf8View for bit_length array string function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.
| ---- | ||
| 48 176 32 40 | ||
|
|
||
| query error DataFusion error: Arrow error: Compute error: bit_length not supported for Utf8View |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add a TODO comment here and file an issue to track apache/arrow-rs#6671. After upgrading to the corresponding arrow-rs version, we should address the TODO comment in string/string_query.slt.part#L1100.
|
@austin362667 any chance you have some time to address @goldmedal 's suggestions? |
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
fe442e3 to
82ed616
Compare
|
Thanks @goldmedal @alamb for the review. 🙏 Just addressed these issues |
Signed-off-by: Austin Liu <[email protected]>
jayzhan211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
goldmedal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @austin362667 and @alamb @jayzhan211 for reviewing. Although I think the tests can be improved, I can help to do that in the follow-up PR. Let's merge this PR first.
* Support `Utf8View` for string function `bit_length()` Signed-off-by: Austin Liu <[email protected]> * Add scalar test case Signed-off-by: Austin Liu <[email protected]> * Refine tests Signed-off-by: Austin Liu <[email protected]> * Fix wrong format Signed-off-by: Austin Liu <[email protected]> --------- Signed-off-by: Austin Liu <[email protected]>
Which issue does this PR close?
Closes #13195
Rationale for this change
Thanks to @jayzhan211 , he noticed following issue which string scalar function
bit_length()doesn't supportUtf8Viewtype:What changes are included in this PR?
Update
bit_length()scalar function to supportUtf8ViewAre these changes tested?
Yes for scalar function. No for array function as it's still PR in apache/arrow-rs#6671
Are there any user-facing changes?