[ICU4X] Migrate text analysis and shaping to use ICU4X #436

conor-93 · 2025-10-20T03:03:25Z

Migration of text analysis from Swash → ICU4X

Overview

ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.

ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:

The potential for full locale support for complex line breaking cases (not supported by Swash).
Reliable and up-to-date Unicode data.
Reasonable performance and memory footprint (with the possibility of future improvements).
Full decoupling from Swash (following decoupling for shaping behaviour earlier this year); a significant offloading of maintenance effort.

Notable changes

Removal of first-party bidi embed level resolution logic.
select_font emoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).
Slightly more up-to-date Unicode data than Swash (e.g. a few more Scripts).

Performance/binary size

Binary size for vello_editor is ~100kB larger (9720kB vs 9620kB).
There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):

Default Style - arabic 20 characters               [   9.9 us ...  11.0 us ]     +10.91%*
Default Style - latin 20 characters                [   4.3 us ...   4.8 us ]     +10.88%*
Default Style - japanese 20 characters             [   8.5 us ...   9.2 us ]      +8.04%*
Default Style - arabic 1 paragraph                 [  59.1 us ...  63.7 us ]      +7.70%*
Default Style - latin 1 paragraph                  [  18.8 us ...  20.8 us ]     +10.44%*
Default Style - japanese 1 paragraph               [  74.4 us ...  78.8 us ]      +5.80%*
Default Style - arabic 4 paragraph                 [ 253.6 us ... 269.6 us ]      +6.30%*
Default Style - latin 4 paragraph                  [  79.7 us ...  86.8 us ]      +8.97%*
Default Style - japanese 4 paragraph               [ 102.8 us ... 107.9 us ]      +4.98%*
Styled - arabic 20 characters                      [  11.1 us ...  12.2 us ]      +9.79%*
Styled - latin 20 characters                       [   5.6 us ...   6.2 us ]     +10.34%*
Styled - japanese 20 characters                    [   9.4 us ...  10.1 us ]      +7.41%*
Styled - arabic 1 paragraph                        [  60.1 us ...  65.0 us ]      +8.04%*
Styled - latin 1 paragraph                         [  21.8 us ...  23.9 us ]      +9.66%*
Styled - japanese 1 paragraph                      [  84.2 us ...  87.4 us ]      +3.79%*
Styled - arabic 4 paragraph                        [ 270.5 us ... 288.5 us ]      +6.66%*
Styled - latin 4 paragraph                         [  85.4 us ...  94.1 us ]     +10.17%*
Styled - japanese 4 paragraph                      [ 117.2 us ... 123.5 us ]      +5.39%*

As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:

Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:

combine line and word boundary calculations (rather for them to run separately to each other). But, Chad may have ideas on further improvement.
pass in character boundary information from our composite properties Trie. ICU is internally performing multiple lookups for identical characters
pass in bidi class information to unicode-bidi to prevent redundant lookups

Other details

Swash's Language parsing is more tolerant, e.g. it permits extra, invalid subtags (like in "en-Latn-US-a-b-c-d").
Segmenters (line, word, grapheme) are currently content-aware, and can be used without specifying a locale. However, if we plug locale data in at runtime, we can construct segmenters to target a specific locale, rather than inferring from content (which would be the most correct approach for targeting said locale).
- The full set of locale data (even with ICU4X's deduplication) is heavy, totalling ~2.5MB (in vello_editor compilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (with icu4x-datagen) and attached (using DataProviders) at runtime in the future.
- Without full locale support, line and word breaking use Unicode rule-based approaches UAX #14 and #29 respectively (at parity with Swash).
Swash's support for alternating word break strength is maintained, by breaking text into windows (which look back/forward an extra character for context) and performaing segmentation on each window separarely, as ICU4X doesn't natively support variable word break strength when segmenting.

Future Work

We could also support bring-your-own-data for Unicode character information too, for users only interested in narrow character sets (e.g. basic Latin), for a small compilation size improvement (not sure how much exactly).
Feature flagging which locale data to bake into the binary
Allow hot swapping unicode character data at runtime. For example, if you start off shaping en but then need to shape some ar, we could inform the consumer that they need to provide ar property data.

… tests

- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed

- clean up tests - add tests for multi-character graphemes

- group all word boundary logic together

…start` - doc

- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping

…of allocating

…mentation

- compute `force_normalize`

- simplify ClusterInfo to just `is_emoji` - more clean-up

# Conflicts: # parley/src/editing/selection.rs # parley/src/layout/cluster.rs # parley/src/layout/data.rs # parley/src/layout/mod.rs

- copyright headers

…#452) #436 is getting too big 😅 . This PR extracts the Unicode generation crates from that PR. ## Intent In order to migrate to ICU4X, we need to actually use its data. In our first implementation, we simply baked data generation into Parley's `build.rs`. This is problematic because: 1. Parley's build time increases by over 1 minute 2. Parley can only be built in `std` environments So, this PR adopts a strategy of checking in the generated data to git ([see docs](https://arc.net/l/quote/uhjmmaae)) in order to reduce our build time, not require `std` for building `Parley`, and reduce `Parley`'s dependency tree. The crates we have introduced are: 1. `unicode_data_gen`: generates the Unicode data 2. `unicode_data`: exposes the Unicode data ## The Data We expose two sets of data from `unicode_data`: 1. Re-exported ICU4X data providers for grapheme, word, and line breaking, plus Unicode normalization tables used by Parley. 1. A locale-invariant `CompositePropsV1` provider backed by a compact `CodePointTrie`. Essentially, this data structure allows us to lookup all properties of a given character with a single lookup (rather than N property lookups per character). This was found to have significant performance savings. You can see how these data structures are used in #436. ## Extra - As can be seen from generated code (shown below), ICU4X has an MSRV of 1.83. So, this PR also raises the MSRV of the repo to 1.83. https://github.com/linebender/parley/blob/97cdb3bc7edafb31e9ddc3212854761fe932f522/unicode_data/src/generated/icu4x_data/normalizer_nfd_data_v1.rs.data#L60 - We test that the generated ICU4X data is current (and not corrupted) by running it in CI. CLDR data doesn't update too often, so this shouldn't impact our day-to-day PRs and instead should be a helpful way to alert us of updates (if we somehow miss those updates ourselves!). - For now, we're only targeting `en` locales for line and word breaking falling back to [UAX #14](https://github.com/unicode-org/icu4x/blob/ee5399a77a6b94efb5d4b60678bb458c5eedb25d/components/segmenter/src/line.rs#L233-L255) and [#29](https://github.com/unicode-org/icu4x/blob/45638aba928c990c4a62360b6bec8e75000b73db/components/segmenter/src/lib.rs#L5-L21) respectively (at parity with Swash). We will introduce a BYO Unicode data API after the initial migration is merged and, perhaps, feature flag which locales to bake (if any). - I've introduced all the dependencies required on ICU4X in this PR (so some may be unused here, but used in #436). ## PR Train - #452 ◀️ - #436

conor-93 added 30 commits September 15, 2025 08:11

side-by-side analysis with Swash (bidi levels, boundaries, scripts) +…

586f39f

… tests

- resolve Mandatory boundaries

f03116d

- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed

- boundary analysis clean-up / condense logic / todos for optimisations

4054508

- clean up tests - add tests for multi-character graphemes

- avoid consuming and re-creating iterator over word boundary data

0ec33d4

- group all word boundary logic together

- remove previous_substring_end, made redundant by `building_range_…

4ab3adc

…start` - doc

avoid consuming iterator when getting first/last char lens

ca6b03a

avoid unnecessarily consuming iterators for script/line break data

8de6f53

.

b2a0fc2

.

9895c38

- dont reallocate string for fast path

326fd09

- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping

.

2d9a397

avoid allocating vecs for boundaries/bidi levels

8ce1898

.

663f79f

.

0971ca3

establish an iterator for contiguous_word_break_substrings instead …

d3303ae

…of allocating

just store index, not char too

fb20378

.

26a8591

.

2581a69

.

ac31353

address TODOs

f3a53a4

add Swash-equivalent Cluster types to Parley, WIP select_font reimple…

4406abd

…mentation

icu-backed select_font equivalent impl

00609fb

select_font working, minus force_normalize

36ed996

force_normalize pseudocode/groundwork

0621f5d

- frontload/simplify analysis info access

fb24f87

- compute `force_normalize`

fix crash on empty style ranges

1ea1c3a

use icu for everything except script

09bac72

use icu for script/locale/language

c77d830

- simplify and fix bidi level retrieval + add tests

77c2a36

- simplify ClusterInfo to just `is_emoji` - more clean-up

optimise is_emoji_grapheme

1269dcc

conor-93 added 7 commits November 6, 2025 13:23

use icu4x for short script name, when mapping script for fontique

e35d251

Merge remote-tracking branch 'origin/main' into icu4x

729a23e

# Conflicts: # parley/src/editing/selection.rs # parley/src/layout/cluster.rs # parley/src/layout/data.rs # parley/src/layout/mod.rs

cargo fmt

390e8f4

cargo clippy

e25b185

.

2f323c5

.

c527350

- no_std

dc677f3

- copyright headers

conor-93 added a commit to conor-93/parley that referenced this pull request Nov 6, 2025

flatten linebender#436

6526eb4

taj-p added 14 commits November 11, 2025 05:48

Fix imports

92ee693

Fix deps

3a54c15

Generate unicode_data in unicode_data_gen

181f824

Upgrade to 1.83

095ca3d

.

d5119cf

Ignore unicode_data_gen in wasm and Android

ca77f79

Update benchmarks

2b38851

.

acfd1dc

Add job to CI to test unicode-data regeneration

42cfcf3

.

7398981

Add copyright header

df56c4f

Use while let loop

772ada4

Add licenses, update README, and crate doco

27af17b

.

74ffdcc

taj-p mentioned this pull request Nov 11, 2025

[ICU4X]: Add unicode_data and unicode_data_gen crates + bump MSRV #452

Merged

taj-p changed the title ~~Migrate to ICU4X~~ [ICU4X] Migrate text analysis and shaping to use ICU4x Nov 11, 2025

taj-p changed the title ~~[ICU4X] Migrate text analysis and shaping to use ICU4x~~ [ICU4X] Migrate text analysis and shaping to use ICU4X Nov 11, 2025

Use match for bidi_class

984ac0b

taj-p force-pushed the icu4x branch 2 times, most recently from 7444a92 to 984ac0b Compare November 17, 2025 21:39

Merge remote-tracking branch 'origin/main' into icu4x

052a2e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ICU4X] Migrate text analysis and shaping to use ICU4X #436

[ICU4X] Migrate text analysis and shaping to use ICU4X #436

conor-93 commented Oct 20, 2025 •

edited by taj-p

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ICU4X] Migrate text analysis and shaping to use ICU4X #436

Are you sure you want to change the base?

[ICU4X] Migrate text analysis and shaping to use ICU4X #436

Conversation

conor-93 commented Oct 20, 2025 • edited by taj-p Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Migration of text analysis from Swash → ICU4X

Future Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

conor-93 commented Oct 20, 2025 •

edited by taj-p

Loading