Summary
When using Option+Left / Option+Right (or Alt+Left / Alt+Right) for word-by-word cursor movement in the TUI composer, the cursor jumps over entire runs of CJK (Chinese/Japanese/Korean) characters as if they were a single word. This makes editing East Asian text very awkward.
Root Cause
textarea.rs defines word boundaries using a fixed set of ASCII punctuation separators:
// codex-rs/tui/src/bottom_pane/textarea.rs
const WORD_SEPARATORS: &str = "`~!@#$%^&*()-=+[{]}\\|;:'\",.<>/?";
fn is_word_separator(ch: char) -> bool {
WORD_SEPARATORS.contains(ch)
}
beginning_of_previous_word() and end_of_next_word() classify each character as either a "separator" or a "regular character", and continue moving until the classification changes or whitespace is encountered. Because CJK characters are neither whitespace nor in the separator list, a sequence like 你好世界 is treated as one continuous word — the cursor skips all four characters in a single keystroke.
The unicode-segmentation crate is already a transitive dependency (its path appears in the binary's embedded debug info), but its word-boundary logic (unicode_words() / UAX #29) is not used for cursor movement. The crate is used for grapheme-cluster boundaries (single-character movement), but not for word movement.
Steps to Reproduce
- Open Codex TUI.
- Type a Chinese sentence, e.g.
你好世界 hello.
- Press Option+Left (macOS) or Alt+Left (Linux/Windows) repeatedly.
Observed: cursor jumps from the end of 你好世界 directly to position 0 in one keystroke — the whole CJK run is treated as one word.
Expected: cursor moves one logical unit at a time through the CJK text (ideally each character, consistent with how editors such as VS Code, Zed, and Terminal readline behave for CJK input).
Failing Tests to Add
The following tests can be added to the #[cfg(test)] block in codex-rs/tui/src/bottom_pane/textarea.rs to codify the expected behavior:
#[test]
fn word_navigation_cjk_each_char_is_boundary() {
// Each CJK character should be treated as its own word unit.
// Cursor placed after 世 (index 9, byte offset depends on UTF-8 encoding):
// 你(3) 好(3) 世(3) 界(3) → byte offsets: 0,3,6,9,12
let text = "你好世界";
let mut t = ta_with(text);
// Start at end of text (after 界, byte 12)
t.set_cursor(text.len()); // 12
assert_eq!(t.beginning_of_previous_word(), 9, "Alt+Left from end should land at start of 界");
t.set_cursor(9);
assert_eq!(t.beginning_of_previous_word(), 6, "Alt+Left should land at start of 世");
t.set_cursor(6);
assert_eq!(t.beginning_of_previous_word(), 3, "Alt+Left should land at start of 好");
t.set_cursor(3);
assert_eq!(t.beginning_of_previous_word(), 0, "Alt+Left should land at start of 你");
}
#[test]
fn word_navigation_cjk_forward() {
let text = "你好世界";
let mut t = ta_with(text);
t.set_cursor(0);
assert_eq!(t.end_of_next_word(), 3, "Alt+Right from start should land after 你");
t.set_cursor(3);
assert_eq!(t.end_of_next_word(), 6, "Alt+Right should land after 好");
t.set_cursor(6);
assert_eq!(t.end_of_next_word(), 9, "Alt+Right should land after 世");
t.set_cursor(9);
assert_eq!(t.end_of_next_word(), 12, "Alt+Right should land after 界");
}
#[test]
fn word_navigation_mixed_ascii_cjk() {
// Mixed text: "hello你好" — the boundary between ASCII and CJK should also be respected.
let text = "hello你好";
let mut t = ta_with(text);
// Forward from start: "hello" is one word (bytes 0..5)
t.set_cursor(0);
assert_eq!(t.end_of_next_word(), 5, "Alt+Right should stop after 'hello'");
// Forward from after "hello": 你 is next unit (bytes 5..8)
t.set_cursor(5);
assert_eq!(t.end_of_next_word(), 8, "Alt+Right should stop after 你");
// Backward from end: 好 (bytes 8..11), so start of 好 is 8
t.set_cursor(text.len()); // 11
assert_eq!(t.beginning_of_previous_word(), 8, "Alt+Left should land at start of 好");
// Backward from start of 好: 你 (bytes 5..8)
t.set_cursor(8);
assert_eq!(t.beginning_of_previous_word(), 5, "Alt+Left should land at start of 你");
// Backward from start of 你: "hello" (bytes 0..5)
t.set_cursor(5);
assert_eq!(t.beginning_of_previous_word(), 0, "Alt+Left should land at start of 'hello'");
}
Suggested Fix
In beginning_of_previous_word() and end_of_next_word(), treat any non-ASCII character as its own word unit (i.e., break on every non-ASCII character boundary). A minimal approach inside the existing loops:
// When iterating char-by-char, treat non-ASCII chars as individual word atoms:
if !ch.is_ascii() {
// non-ASCII characters each form their own word boundary
start = idx + ch.len_utf8();
break;
}
A more principled fix would leverage the already-present unicode-segmentation crate and use UnicodeSegmentation::unicode_words() for segment enumeration, which follows UAX #29 word-break rules and handles CJK, Arabic, and other non-Latin scripts correctly.
Environment
- Codex version: 0.118.0
- OS: macOS 15 (arm64)
- Terminal: iTerm2 / Terminal.app
- Affected keybindings:
Option+Left, Option+Right, Alt+Left, Alt+Right, Meta+b, Meta+f
Affected Source File
codex-rs/tui/src/bottom_pane/textarea.rs, functions beginning_of_previous_word() (line 1211) and end_of_next_word() (line 1232).
Summary
When using Option+Left / Option+Right (or Alt+Left / Alt+Right) for word-by-word cursor movement in the TUI composer, the cursor jumps over entire runs of CJK (Chinese/Japanese/Korean) characters as if they were a single word. This makes editing East Asian text very awkward.
Root Cause
textarea.rsdefines word boundaries using a fixed set of ASCII punctuation separators:beginning_of_previous_word()andend_of_next_word()classify each character as either a "separator" or a "regular character", and continue moving until the classification changes or whitespace is encountered. Because CJK characters are neither whitespace nor in the separator list, a sequence like你好世界is treated as one continuous word — the cursor skips all four characters in a single keystroke.The
unicode-segmentationcrate is already a transitive dependency (its path appears in the binary's embedded debug info), but its word-boundary logic (unicode_words()/UAX #29) is not used for cursor movement. The crate is used for grapheme-cluster boundaries (single-character movement), but not for word movement.Steps to Reproduce
你好世界 hello.Observed: cursor jumps from the end of
你好世界directly to position 0 in one keystroke — the whole CJK run is treated as one word.Expected: cursor moves one logical unit at a time through the CJK text (ideally each character, consistent with how editors such as VS Code, Zed, and Terminal readline behave for CJK input).
Failing Tests to Add
The following tests can be added to the
#[cfg(test)]block incodex-rs/tui/src/bottom_pane/textarea.rsto codify the expected behavior:Suggested Fix
In
beginning_of_previous_word()andend_of_next_word(), treat any non-ASCII character as its own word unit (i.e., break on every non-ASCII character boundary). A minimal approach inside the existing loops:A more principled fix would leverage the already-present
unicode-segmentationcrate and useUnicodeSegmentation::unicode_words()for segment enumeration, which followsUAX #29word-break rules and handles CJK, Arabic, and other non-Latin scripts correctly.Environment
Option+Left,Option+Right,Alt+Left,Alt+Right,Meta+b,Meta+fAffected Source File
codex-rs/tui/src/bottom_pane/textarea.rs, functionsbeginning_of_previous_word()(line 1211) andend_of_next_word()(line 1232).