Skip to content

TUI word movement (Option/Alt+Arrow) treats entire CJK sequences as a single word #16584

@xc1427

Description

@xc1427

Summary

When using Option+Left / Option+Right (or Alt+Left / Alt+Right) for word-by-word cursor movement in the TUI composer, the cursor jumps over entire runs of CJK (Chinese/Japanese/Korean) characters as if they were a single word. This makes editing East Asian text very awkward.

Root Cause

textarea.rs defines word boundaries using a fixed set of ASCII punctuation separators:

// codex-rs/tui/src/bottom_pane/textarea.rs
const WORD_SEPARATORS: &str = "`~!@#$%^&*()-=+[{]}\\|;:'\",.<>/?";

fn is_word_separator(ch: char) -> bool {
    WORD_SEPARATORS.contains(ch)
}

beginning_of_previous_word() and end_of_next_word() classify each character as either a "separator" or a "regular character", and continue moving until the classification changes or whitespace is encountered. Because CJK characters are neither whitespace nor in the separator list, a sequence like 你好世界 is treated as one continuous word — the cursor skips all four characters in a single keystroke.

The unicode-segmentation crate is already a transitive dependency (its path appears in the binary's embedded debug info), but its word-boundary logic (unicode_words() / UAX #29) is not used for cursor movement. The crate is used for grapheme-cluster boundaries (single-character movement), but not for word movement.

Steps to Reproduce

  1. Open Codex TUI.
  2. Type a Chinese sentence, e.g. 你好世界 hello.
  3. Press Option+Left (macOS) or Alt+Left (Linux/Windows) repeatedly.

Observed: cursor jumps from the end of 你好世界 directly to position 0 in one keystroke — the whole CJK run is treated as one word.

Expected: cursor moves one logical unit at a time through the CJK text (ideally each character, consistent with how editors such as VS Code, Zed, and Terminal readline behave for CJK input).

Failing Tests to Add

The following tests can be added to the #[cfg(test)] block in codex-rs/tui/src/bottom_pane/textarea.rs to codify the expected behavior:

#[test]
fn word_navigation_cjk_each_char_is_boundary() {
    // Each CJK character should be treated as its own word unit.
    // Cursor placed after 世 (index 9, byte offset depends on UTF-8 encoding):
    //   你(3) 好(3) 世(3) 界(3)  →  byte offsets: 0,3,6,9,12
    let text = "你好世界";
    let mut t = ta_with(text);

    // Start at end of text (after 界, byte 12)
    t.set_cursor(text.len()); // 12
    assert_eq!(t.beginning_of_previous_word(), 9, "Alt+Left from end should land at start of 界");

    t.set_cursor(9);
    assert_eq!(t.beginning_of_previous_word(), 6, "Alt+Left should land at start of 世");

    t.set_cursor(6);
    assert_eq!(t.beginning_of_previous_word(), 3, "Alt+Left should land at start of 好");

    t.set_cursor(3);
    assert_eq!(t.beginning_of_previous_word(), 0, "Alt+Left should land at start of 你");
}

#[test]
fn word_navigation_cjk_forward() {
    let text = "你好世界";
    let mut t = ta_with(text);

    t.set_cursor(0);
    assert_eq!(t.end_of_next_word(), 3, "Alt+Right from start should land after 你");

    t.set_cursor(3);
    assert_eq!(t.end_of_next_word(), 6, "Alt+Right should land after 好");

    t.set_cursor(6);
    assert_eq!(t.end_of_next_word(), 9, "Alt+Right should land after 世");

    t.set_cursor(9);
    assert_eq!(t.end_of_next_word(), 12, "Alt+Right should land after 界");
}

#[test]
fn word_navigation_mixed_ascii_cjk() {
    // Mixed text: "hello你好" — the boundary between ASCII and CJK should also be respected.
    let text = "hello你好";
    let mut t = ta_with(text);

    // Forward from start: "hello" is one word (bytes 0..5)
    t.set_cursor(0);
    assert_eq!(t.end_of_next_word(), 5, "Alt+Right should stop after 'hello'");

    // Forward from after "hello": 你 is next unit (bytes 5..8)
    t.set_cursor(5);
    assert_eq!(t.end_of_next_word(), 8, "Alt+Right should stop after 你");

    // Backward from end: 好 (bytes 8..11), so start of 好 is 8
    t.set_cursor(text.len()); // 11
    assert_eq!(t.beginning_of_previous_word(), 8, "Alt+Left should land at start of 好");

    // Backward from start of 好: 你 (bytes 5..8)
    t.set_cursor(8);
    assert_eq!(t.beginning_of_previous_word(), 5, "Alt+Left should land at start of 你");

    // Backward from start of 你: "hello" (bytes 0..5)
    t.set_cursor(5);
    assert_eq!(t.beginning_of_previous_word(), 0, "Alt+Left should land at start of 'hello'");
}

Suggested Fix

In beginning_of_previous_word() and end_of_next_word(), treat any non-ASCII character as its own word unit (i.e., break on every non-ASCII character boundary). A minimal approach inside the existing loops:

// When iterating char-by-char, treat non-ASCII chars as individual word atoms:
if !ch.is_ascii() {
    // non-ASCII characters each form their own word boundary
    start = idx + ch.len_utf8();
    break;
}

A more principled fix would leverage the already-present unicode-segmentation crate and use UnicodeSegmentation::unicode_words() for segment enumeration, which follows UAX #29 word-break rules and handles CJK, Arabic, and other non-Latin scripts correctly.

Environment

  • Codex version: 0.118.0
  • OS: macOS 15 (arm64)
  • Terminal: iTerm2 / Terminal.app
  • Affected keybindings: Option+Left, Option+Right, Alt+Left, Alt+Right, Meta+b, Meta+f

Affected Source File

codex-rs/tui/src/bottom_pane/textarea.rs, functions beginning_of_previous_word() (line 1211) and end_of_next_word() (line 1232).

Metadata

Metadata

Assignees

Labels

TUIIssues related to the terminal user interface: text input, menus and dialogs, and terminal displaybugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions