Skip to content

Conversation

@conor-93
Copy link

@conor-93 conor-93 commented Oct 20, 2025

Migration of text analysis from Swash → ICU4X

Overview

ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.

ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:

  • The potential for full locale support for complex line breaking cases (not supported by Swash).
  • Reliable and up-to-date Unicode data.
  • Reasonable performance and memory footprint (with the possibility of future improvements).
  • Full decoupling from Swash (following decoupling for shaping behaviour earlier this year); a significant offloading of maintenance effort.

Notable changes

  • Removal of first-party bidi embed level resolution logic.
  • select_font emoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).
  • Slightly more up-to-date Unicode data than Swash (e.g. a few more Scripts).

Performance/binary size

  • Binary size for vello_editor is ~100kB larger (9720kB vs 9620kB).
    There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):
Default Style - arabic 20 characters               [   9.9 us ...  11.0 us ]     +10.91%*
Default Style - latin 20 characters                [   4.3 us ...   4.8 us ]     +10.88%*
Default Style - japanese 20 characters             [   8.5 us ...   9.2 us ]      +8.04%*
Default Style - arabic 1 paragraph                 [  59.1 us ...  63.7 us ]      +7.70%*
Default Style - latin 1 paragraph                  [  18.8 us ...  20.8 us ]     +10.44%*
Default Style - japanese 1 paragraph               [  74.4 us ...  78.8 us ]      +5.80%*
Default Style - arabic 4 paragraph                 [ 253.6 us ... 269.6 us ]      +6.30%*
Default Style - latin 4 paragraph                  [  79.7 us ...  86.8 us ]      +8.97%*
Default Style - japanese 4 paragraph               [ 102.8 us ... 107.9 us ]      +4.98%*
Styled - arabic 20 characters                      [  11.1 us ...  12.2 us ]      +9.79%*
Styled - latin 20 characters                       [   5.6 us ...   6.2 us ]     +10.34%*
Styled - japanese 20 characters                    [   9.4 us ...  10.1 us ]      +7.41%*
Styled - arabic 1 paragraph                        [  60.1 us ...  65.0 us ]      +8.04%*
Styled - latin 1 paragraph                         [  21.8 us ...  23.9 us ]      +9.66%*
Styled - japanese 1 paragraph                      [  84.2 us ...  87.4 us ]      +3.79%*
Styled - arabic 4 paragraph                        [ 270.5 us ... 288.5 us ]      +6.66%*
Styled - latin 4 paragraph                         [  85.4 us ...  94.1 us ]     +10.17%*
Styled - japanese 4 paragraph                      [ 117.2 us ... 123.5 us ]      +5.39%*

As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:

image

Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:

  1. combine line and word boundary calculations (rather for them to run separately to each other). But, Chad may have ideas on further improvement.
  2. pass in character boundary information from our composite properties Trie. ICU is internally performing multiple lookups for identical characters
  3. pass in bidi class information to unicode-bidi to prevent redundant lookups

Other details

  • Swash's Language parsing is more tolerant, e.g. it permits extra, invalid subtags (like in "en-Latn-US-a-b-c-d").
  • Segmenters (line, word, grapheme) are currently content-aware, and can be used without specifying a locale. However, if we plug locale data in at runtime, we can construct segmenters to target a specific locale, rather than inferring from content (which would be the most correct approach for targeting said locale).
    • The full set of locale data (even with ICU4X's deduplication) is heavy, totalling ~2.5MB (in vello_editor compilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (with icu4x-datagen) and attached (using DataProviders) at runtime in the future.
    • Without full locale support, line and word breaking use Unicode rule-based approaches UAX #14 and #29 respectively (at parity with Swash).
  • Swash's support for alternating word break strength is maintained, by breaking text into windows (which look back/forward an extra character for context) and performaing segmentation on each window separarely, as ICU4X doesn't natively support variable word break strength when segmenting.

Future Work

  • We could also support bring-your-own-data for Unicode character information too, for users only interested in narrow character sets (e.g. basic Latin), for a small compilation size improvement (not sure how much exactly).
  • Feature flagging which locale data to bake into the binary
  • Allow hot swapping unicode character data at runtime. For example, if you start off shaping en but then need to shape some ar, we could inform the consumer that they need to provide ar property data.

- condense all byte indexes to char indexes in a single loop
- track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed
- clean up tests
- add tests for multi-character graphemes
- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this
- test naming/grouping
- compute `force_normalize`
- simplify ClusterInfo to just `is_emoji`
- more clean-up
# Conflicts:
#	parley/src/editing/selection.rs
#	parley/src/layout/cluster.rs
#	parley/src/layout/data.rs
#	parley/src/layout/mod.rs
- copyright headers
conor-93 added a commit to conor-93/parley that referenced this pull request Nov 6, 2025
@taj-p taj-p changed the title Migrate to ICU4X [ICU4X] Migrate text analysis and shaping to use ICU4x Nov 11, 2025
@taj-p taj-p changed the title [ICU4X] Migrate text analysis and shaping to use ICU4x [ICU4X] Migrate text analysis and shaping to use ICU4X Nov 11, 2025
@taj-p taj-p force-pushed the icu4x branch 2 times, most recently from 7444a92 to 984ac0b Compare November 17, 2025 21:39
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2025
…#452)

#436 is getting too big 😅 .
This PR extracts the Unicode generation crates from that PR.

## Intent

In order to migrate to ICU4X, we need to actually use its data. In our
first implementation, we simply baked data generation into Parley's
`build.rs`. This is problematic because:

1. Parley's build time increases by over 1 minute
2. Parley can only be built in `std` environments

So, this PR adopts a strategy of checking in the generated data to git
([see docs](https://arc.net/l/quote/uhjmmaae)) in order to reduce our
build time, not require `std` for building `Parley`, and reduce
`Parley`'s dependency tree. The crates we have introduced are:

1. `unicode_data_gen`: generates the Unicode data
2. `unicode_data`: exposes the Unicode data

## The Data

We expose two sets of data from `unicode_data`:

1. Re-exported ICU4X data providers for grapheme, word, and line
breaking, plus Unicode normalization tables used by Parley.
1. A locale-invariant `CompositePropsV1` provider backed by a compact
`CodePointTrie`. Essentially, this data structure allows us to lookup
all properties of a given character with a single lookup (rather than N
property lookups per character). This was found to have significant
performance savings.

You can see how these data structures are used in
#436.

## Extra

- As can be seen from generated code (shown below), ICU4X has an MSRV of
1.83. So, this PR also raises the MSRV of the repo to 1.83.


https://github.com/linebender/parley/blob/97cdb3bc7edafb31e9ddc3212854761fe932f522/unicode_data/src/generated/icu4x_data/normalizer_nfd_data_v1.rs.data#L60

- We test that the generated ICU4X data is current (and not corrupted)
by running it in CI. CLDR data doesn't update too often, so this
shouldn't impact our day-to-day PRs and instead should be a helpful way
to alert us of updates (if we somehow miss those updates ourselves!).
- For now, we're only targeting `en` locales for line and word breaking
falling back to [UAX
#14](https://github.com/unicode-org/icu4x/blob/ee5399a77a6b94efb5d4b60678bb458c5eedb25d/components/segmenter/src/line.rs#L233-L255)
and
[#29](https://github.com/unicode-org/icu4x/blob/45638aba928c990c4a62360b6bec8e75000b73db/components/segmenter/src/lib.rs#L5-L21)
respectively (at parity with Swash). We will introduce a BYO Unicode
data API after the initial migration is merged and, perhaps, feature
flag which locales to bake (if any).
- I've introduced all the dependencies required on ICU4X in this PR (so
some may be unused here, but used in
#436).

## PR Train

- #452 ◀️ 
- #436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants