Fix line breaking for Khmer text (issue #7218) #7232
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Spaces in Khmer text create isolated segments with breaks before AND after each space:
"អស់ នឹង មាន"['អស់', ' ', 'នឹង', ' ', 'មាន']❌ (isolated spaces)Solution Overview
Fix two issues that prevented proper space handling:
Changes
components/segmenter/src/complex/language.rsWhat changed: Don't split text on whitespace characters
Lines ~62 & ~105: Modified both UTF-8 and UTF-16 iterators to skip whitespace when checking for language changes
Effect: Khmer phrases with spaces stay together as one chunk:
"អស់ នឹង"instead of"អស់"," ","នឹង"components/segmenter/src/line.rsWhat changed: Handle SA×SPACE×SA (complex script + space + complex script) sequences
4 changes:
~Line 1070: Add
peek_past_spaces_for_sa()helper~Line 880: Extend complex breaking trigger
SA × SASA × SAORSA × SPACE × SA~Line 908: Suppress UAX#14 breaks
SA × SPif SA continues after space(s)~Lines 1165 & 1198: Include spaces in text collection
"អស់ នឹង"Effect: Complex segmenter (LSTM/dictionary) handles the entire SA×SPACE×SA sequence intelligently
Result
[0, 9, 10, 19, 20, ...](double breaks)[0, 9, 19, 29, ...](single breaks)['អស់', ' នឹង', ' មាន']✅Impact
Fixes line breaking for Khmer. Also possibly Thai, Lao, and Myanmar scripts.
Matches ICU4C behavior.