-
Notifications
You must be signed in to change notification settings - Fork 56
[ICU4X] Migrate text analysis and shaping to use ICU4X #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
conor-93
wants to merge
126
commits into
linebender:main
Choose a base branch
from
conor-93:icu4x
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,746
−1,169
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- condense all byte indexes to char indexes in a single loop - track a minimal set of LineSegmenters (per LineBreakWordOption), and create as needed
- clean up tests - add tests for multi-character graphemes
- group all word boundary logic together
- fix incorrect start truncation for multi-style strings which arent multi-wb style + test for this - test naming/grouping
- compute `force_normalize`
- simplify ClusterInfo to just `is_emoji` - more clean-up
# Conflicts: # parley/src/editing/selection.rs # parley/src/layout/cluster.rs # parley/src/layout/data.rs # parley/src/layout/mod.rs
conor-93
added a commit
to conor-93/parley
that referenced
this pull request
Nov 6, 2025
7444a92 to
984ac0b
Compare
github-merge-queue bot
pushed a commit
that referenced
this pull request
Nov 20, 2025
…#452) #436 is getting too big 😅 . This PR extracts the Unicode generation crates from that PR. ## Intent In order to migrate to ICU4X, we need to actually use its data. In our first implementation, we simply baked data generation into Parley's `build.rs`. This is problematic because: 1. Parley's build time increases by over 1 minute 2. Parley can only be built in `std` environments So, this PR adopts a strategy of checking in the generated data to git ([see docs](https://arc.net/l/quote/uhjmmaae)) in order to reduce our build time, not require `std` for building `Parley`, and reduce `Parley`'s dependency tree. The crates we have introduced are: 1. `unicode_data_gen`: generates the Unicode data 2. `unicode_data`: exposes the Unicode data ## The Data We expose two sets of data from `unicode_data`: 1. Re-exported ICU4X data providers for grapheme, word, and line breaking, plus Unicode normalization tables used by Parley. 1. A locale-invariant `CompositePropsV1` provider backed by a compact `CodePointTrie`. Essentially, this data structure allows us to lookup all properties of a given character with a single lookup (rather than N property lookups per character). This was found to have significant performance savings. You can see how these data structures are used in #436. ## Extra - As can be seen from generated code (shown below), ICU4X has an MSRV of 1.83. So, this PR also raises the MSRV of the repo to 1.83. https://github.com/linebender/parley/blob/97cdb3bc7edafb31e9ddc3212854761fe932f522/unicode_data/src/generated/icu4x_data/normalizer_nfd_data_v1.rs.data#L60 - We test that the generated ICU4X data is current (and not corrupted) by running it in CI. CLDR data doesn't update too often, so this shouldn't impact our day-to-day PRs and instead should be a helpful way to alert us of updates (if we somehow miss those updates ourselves!). - For now, we're only targeting `en` locales for line and word breaking falling back to [UAX #14](https://github.com/unicode-org/icu4x/blob/ee5399a77a6b94efb5d4b60678bb458c5eedb25d/components/segmenter/src/line.rs#L233-L255) and [#29](https://github.com/unicode-org/icu4x/blob/45638aba928c990c4a62360b6bec8e75000b73db/components/segmenter/src/lib.rs#L5-L21) respectively (at parity with Swash). We will introduce a BYO Unicode data API after the initial migration is merged and, perhaps, feature flag which locales to bake (if any). - I've introduced all the dependencies required on ICU4X in this PR (so some may be unused here, but used in #436). ## PR Train - #452◀️ - #436
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Migration of text analysis from Swash → ICU4X
Overview
ICU4X enables text analysis and internationalisation. For Parley, this includes locale and language recognition,
bidirectional text evaluation, text segmentation, emoji recognition, NFC/NFD normalisation and other Unicode character information.
ICU4X is developed and maintained by a trusted authority in the space of text internationalisation: the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. It is targeted at resource-constrained environments. For Parley, this means:
Notable changes
select_fontemoji detection improvements (Flag emoji "🇺🇸", Keycap sequences (e.g. 0️⃣ through 9️⃣) now supported in cluster detection, Swash did not support these).Scripts).Performance/binary size
vello_editoris ~100kB larger (9720kB vs 9620kB).There is a performance regression (~7%), but with optimisations (composite trie data sources, further minimisation of allocations/iterations), the regression is much less significant than it was originally (~55%):
As noted in #436 (comment), I think that we're getting close to maximising the efficiency of the current APIs offered by ICU. This can be seen by inspecting the text analysis profile:
Further optimisation of text analysis may require delving into ICU/unicode-bidi internals to, for example:
Other details
Languageparsing is more tolerant, e.g. it permits extra, invalid subtags (like in"en-Latn-US-a-b-c-d").vello_editorcompilation testing). In order to potentially support correct word breaking across all languages, without seeing a huge compilation size increase, we would need a way for users to attach only the locale data they need at runtime. This locale data could be generated (withicu4x-datagen) and attached (usingDataProviders) at runtime in the future.Future Work
enbut then need to shape somear, we could inform the consumer that they need to providearproperty data.