Skip to content

Conversation

@austin362667
Copy link
Contributor

Which issue does this PR close?

Closes #13195

Rationale for this change

Thanks to @jayzhan211 , he noticed following issue which string scalar function bit_length() doesn't support Utf8View type:

select bit_length(arrow_cast('a', 'Utf8View'));

What changes are included in this PR?

Update bit_length() scalar function to support Utf8View

Are these changes tested?

Yes for scalar function. No for array function as it's still PR in apache/arrow-rs#6671

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Nov 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 1, 2024
@austin362667
Copy link
Contributor Author

certainly, thanks for review!

p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL
NULL NULL NULL NULL NULL NULL

# TODO: Support Utf8View for bit_length array string function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- it seems like bit_length is working on Utf8View in datafusion/sqllogictest/test_files/string/string_view.slt

If you just uncomment this code out that would mean we are covering bit_length for all the string types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb
As I understand it, applying bit_length() to a column will currently fail because arrow::compute::kernels::length::bit_length does not yet support UTF-8 views (depends on PR in arrow-rs). However, it's still used in the following code path:

ColumnarValue::Array(v) => Ok(ColumnarValue::Array(bit_length(v.as_ref())?)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @austin362667

Comment on lines 94 to 102
query IIII
SELECT
BIT_LENGTH(arrow_cast('Andrew', 'Utf8View')),
BIT_LENGTH(arrow_cast('datafusion数据融合', 'Utf8View')),
BIT_LENGTH(arrow_cast('💖', 'Utf8View')),
BIT_LENGTH(arrow_cast('josé', 'Utf8View'))
;
----
48 176 32 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this test to string/string_literal.slt and then add similar tests for other string types as well (UTF8, LargeUTF8, and DictionaryString).

p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL
NULL NULL NULL NULL NULL NULL

# TODO: Support Utf8View for bit_length array string function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.

----
48 176 32 40

query error DataFusion error: Arrow error: Compute error: bit_length not supported for Utf8View
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a TODO comment here and file an issue to track apache/arrow-rs#6671. After upgrading to the corresponding arrow-rs version, we should address the TODO comment in string/string_query.slt.part#L1100.

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

@austin362667 any chance you have some time to address @goldmedal 's suggestions?

@austin362667 austin362667 force-pushed the add_utf8view_for_bit_length branch from fe442e3 to 82ed616 Compare November 9, 2024 16:06
@austin362667
Copy link
Contributor Author

Thanks @goldmedal @alamb for the review. 🙏 Just addressed these issues
So sorry for the late reply.

Signed-off-by: Austin Liu <[email protected]>
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @austin362667 and @alamb @jayzhan211 for reviewing. Although I think the tests can be improved, I can help to do that in the follow-up PR. Let's merge this PR first.

@goldmedal goldmedal merged commit 1557fce into apache:main Nov 10, 2024
25 checks passed
jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request Nov 12, 2024
* Support `Utf8View` for string function `bit_length()`

Signed-off-by: Austin Liu <[email protected]>

* Add scalar test case

Signed-off-by: Austin Liu <[email protected]>

* Refine tests

Signed-off-by: Austin Liu <[email protected]>

* Fix wrong format

Signed-off-by: Austin Liu <[email protected]>

---------

Signed-off-by: Austin Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Utf8View for string function bit_length

4 participants