Skip to content

Chinese text (CJK?) issue with word cursor movements #434

@msparkles

Description

@msparkles

Summary

Currently, when trying to move the cursor by an entire word when the text is of Chinese (or CJK in general? untested) script, cosmic-text only moves by 1 character, instead of the entire text until punctuation or until an actual word boundary.

Moreover, when the cursor is next to a punctuation mark, cosmic-text skips an entire character before/after the punctuation.

Example text (the stone lion riddle):

門外有四十四隻獅子,不知是四十四隻死獅子,還是四十四隻石獅子
門外有四十四隻石獅子,旁邊有四十四隻死獅子,不知有沒有活獅子

This is observed on both cosmic-text-editor 1.0.0.beta.4-1 and Rust crate cosmic-text 0.15.0.

Backgrounds

We're speaking as someone whose native language is Mandarin and who have been using Chinese on computers for a decade.

As some may know, the problem with Chinese "word" selection/movement is that written Chinese doesn't use a space character to distinguish between words. Historically, this has been a huge issue, but as technology developed and Input Methods evolved, there started being solutions for this via dictionaries of words or clusters (especially in the input method sphere, as it is practically impossible to make a good Chinese input method without such dictionary).

One example is Chewing, a Zhuyin input method (libchewing, rust docs, csv data), which could serve as reference, provided that a complete and proper solution is within scope.

On Firefox, the correct behaviour is observed:

Image Image Image Image

In programs without such implementation, one particular alternative is usually picked: count each sequence, separated by punctuation or space, as the entire word.

On Kate, this alternative is observed:

Image Image Image Image

Notes

Despite the improvement of technology, the correct behaviour isn't that commonly found, partly due to the alternative behaviour being much more common and being what users are used to.

In mixed text (Chinese + other scripts), some programs stop the word at the boundary with another script, and some still count the other script as within the same word and skip to the next space or punctuation.

Example text ("squeak" in Chinese):

吱吱吱吱squeak吱吱squeak吱吱

Kate counts the entire thing (including the English part) as a word, Firefox counts them as separate. This is partly due to the convention for having foreign text within Chinese text being to pad it out with spaces (i.e. 吱吱 squeak squeak 吱吱).

There are more complicated cases like quotation marks (「」, 『』), "title marks" (quotes specifically for books/chapters) (《》, 〈〉), parentheses (()), but most seem to just treat them as every other punctuation, or like how the English equivalents are treated.

Proposal

Since this is only about Chinese (this doesn't even include Japanese or Korean or other languages that don't use space to separate words, and we can only speak for Chinese here as we don't speak other such languages), the best approach is probably the conventional "just count everything between spaces and punctuations as a word" since it's both better than doing nothing and widely understood by users, and isn't horribly complicated to implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions