Skip to content

Conversation

@uros-db
Copy link
Contributor

@uros-db uros-db commented May 24, 2024

What changes were proposed in this pull request?

String titlecase conversion under UTF8_BINARY_LCASE and other ICU collations now work using the appropriate ICU default locale for character mapping, and uses ICU BreakIterator.getWordInstance to locate boundaries between words.

Why are the changes needed?

Similar Spark expressions such as Lower & Upper use the same interface (UCharacter) to perform collation-aware string transformation, and InitCap should offer a consistant way to titlecase strings across the collation space.

Does this PR introduce any user-facing change?

Yes, InitCap should now work properly for all collations other than UTF8_BINARY.

How was this patch tested?

New and existing unit tests, as well as existing e2e sql tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 24, 2024
@uros-db uros-db changed the title [WIP][SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations [SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations May 27, 2024
@uros-db uros-db requested a review from mkaravel May 31, 2024 12:20
@uros-db uros-db requested a review from dbatomic June 5, 2024 08:24
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Suggested a few more interesting test cases here as well.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3857a9d Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants