[SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations #46732

uros-db · 2024-05-24T07:45:41Z

What changes were proposed in this pull request?

String titlecase conversion under UTF8_BINARY_LCASE and other ICU collations now work using the appropriate ICU default locale for character mapping, and uses ICU BreakIterator.getWordInstance to locate boundaries between words.

Why are the changes needed?

Similar Spark expressions such as Lower & Upper use the same interface (UCharacter) to perform collation-aware string transformation, and InitCap should offer a consistant way to titlecase strings across the collation space.

Does this PR introduce any user-facing change?

Yes, InitCap should now work properly for all collations other than UTF8_BINARY.

How was this patch tested?

New and existing unit tests, as well as existing e2e sql tests.

Was this patch authored or co-authored using generative AI tooling?

No.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java

mkaravel

LGTM. Suggested a few more interesting test cases here as well.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

cloud-fan · 2024-06-10T16:18:00Z

thanks, merging to master!

uros-db added 2 commits May 24, 2024 09:44

Initial commit

4e0fcca

Update tests

3c58f82

github-actions bot added the SQL label May 24, 2024

uros-db added 4 commits May 24, 2024 10:06

More tests for word boundaries

05972b6

More tests for titlecase characters

2f7db12

More tests for case mapping

8c14f50

Fix Java lint

41962e4

uros-db changed the title ~~[WIP][SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations~~ [SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations May 27, 2024

mkaravel reviewed May 27, 2024

View reviewed changes

Merge branch 'apache:master' into initcap-icu

ec4b045

uros-db requested a review from mkaravel May 31, 2024 12:20

dbatomic reviewed Jun 4, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

dbatomic reviewed Jun 4, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Show resolved Hide resolved

Small fixes

7691b87

uros-db requested a review from dbatomic June 5, 2024 08:24

mkaravel approved these changes Jun 7, 2024

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Show resolved Hide resolved

uros-db added 3 commits June 7, 2024 12:32

Merge branch 'master' into initcap-icu

17c953a

Fix tests

18427e7

Fix lint

39285ba

cloud-fan approved these changes Jun 10, 2024

View reviewed changes

cloud-fan closed this in 3857a9d Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations #46732

[SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations #46732

Uh oh!

uros-db commented May 24, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkaravel left a comment

Uh oh!

Uh oh!

cloud-fan commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations #46732

[SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations #46732

Uh oh!

Conversation

uros-db commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkaravel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uros-db commented May 24, 2024 •

edited

Loading