Skip to content

Conversation

@kenton-r
Copy link

@kenton-r kenton-r commented Nov 9, 2025

Problem

Spaces in Khmer text create isolated segments with breaks before AND after each space:

  • Input: "អស់ នឹង មាន"
  • Output: ['អស់', ' ', 'នឹង', ' ', 'មាន'] ❌ (isolated spaces)

Solution Overview

Fix two issues that prevented proper space handling:

  1. language.rs: Spaces were split into separate chunks → complex segmenter never saw them
  2. line.rs: Complex segmenter wasn't triggered for SA×SPACE×SA sequences

Changes

components/segmenter/src/complex/language.rs

What changed: Don't split text on whitespace characters

Lines ~62 & ~105: Modified both UTF-8 and UTF-16 iterators to skip whitespace when checking for language changes

Effect: Khmer phrases with spaces stay together as one chunk: "អស់ នឹង" instead of "អស់", " ", "នឹង"


components/segmenter/src/line.rs

What changed: Handle SA×SPACE×SA (complex script + space + complex script) sequences

4 changes:

  1. ~Line 1070: Add peek_past_spaces_for_sa() helper

    • Looks ahead past consecutive spaces to check if SA continues
  2. ~Line 880: Extend complex breaking trigger

    • Changed from: only trigger for SA × SA
    • Changed to: trigger for SA × SA OR SA × SPACE × SA
  3. ~Line 908: Suppress UAX#14 breaks

    • Don't break at SA × SP if SA continues after space(s)
  4. ~Lines 1165 & 1198: Include spaces in text collection

    • Complex segmenter sees full phrases with spaces: "អស់ នឹង"

Effect: Complex segmenter (LSTM/dictionary) handles the entire SA×SPACE×SA sequence intelligently


Result

  • Before: [0, 9, 10, 19, 20, ...] (double breaks)
  • After: [0, 9, 19, 29, ...] (single breaks)
  • Spaces properly included with words: ['អស់', ' នឹង', ' មាន']

Impact

Fixes line breaking for Khmer. Also possibly Thai, Lao, and Myanmar scripts.
Matches ICU4C behavior.

@CLAassistant
Copy link

CLAassistant commented Nov 9, 2025

CLA assistant check
All committers have signed the CLA.

@sffc
Copy link
Member

sffc commented Nov 10, 2025

Thanks for the contribution!

Please add tests for this behavior. Also, please assert that word break continues to break around spaces, and only line break gets the new behavior.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants