Fix incorrect results for Japanese #77

lorumic · 2019-10-25T07:28:10Z

Since Japanese regex included only Hiragana and Katakana, every input containing kana but more than 50% of Han characters was mistakenly detected as Chinese.

This fix is necessary because many Japanese texts (especially technical ones) may have a kanji ratio higher than 50%. Since the detect method counts the occurrences of matching expressions (characters, in this case), for Japanese it would match only kana (< 50% of total characters), hence returning Chinese as the top language. But if there is even just one kana, the text should be detected as Japanese.

A trivial example:

持込申請最悪状態完了

contains Han characters only so I would agree with it being detected as Chinese, but if I insert the hiragana は particle as follows:

持込申請は最悪状態完了

it should be detected as Japanese, because it is a mostly correct Japanese sentence (abbreviated, if anything), but instead franc reports it as Chinese, due to the fact that 90% of the characters are Han and only 10% is kana.

In order to prevent that, I have added the regex for Japanese kanji, which in my opinion is cleaner than just throwing in the whole Han block, so that when detecting also kanji would be matched for Japanese. So, in the examples above, the first case would still be (correctly) detected as Chinese, but the second one would have a 90% score for Chinese and a 100% score for Japanese, hence being correctly detected as Japanese.

Since Japanese regex included only Hiragana and Katakana, every input with more than 50% of Han characters was mistakenly detected as Chinese

wooorm

Wow this looks cool! Thanks for working on this!! ✨

Could you:

a) provide a test for this? Something that used to fail, but works now?

b) Could it be possible to generate the Kanji range as well? Preferably from unicode-7.0.0 or so?

(aside: I should update that to the latest Unicode)

Excerpts from UHDR (articles 3, 8 and 16). Other articles that used to fail are: 18, 22, 24, 25, 27, 28)

lorumic

Hello Titus, thanks for reviewing the pull request. I will answer to your questions here:

a) Sure thing. I have added tests for 3 of the 9 articles of the Japanese UHDR that, taken alone, used to fail (misdetected as Chinese) and will instead work after the changes.

b) I wish I could, but as far as I know there is no Unicode script that separates Japanese kanji from the rest of Han.

So, looking at the latest list of all Unicode scripts here, Japanese kanji fall in these 2 subsets of Han:

3400..4DB5 ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FEF ; Han # Lo [20976] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FEF

Also, the second subset contains around 50 characters that are not part of the "actual" Japanese kanji range that I found here instead. I have added the link to the Unicode Kanji Table to the build.js file as well, for reference.

Among these almost 27000 characters, the official ones according to the Japanese Ministry of Education are slightly more than 2000, plus another ~1000 (constantly being updated by the Ministry of Justice) used for registered names, so around 3000 in total, but the Rikai kanji table was the most reliable source that I could find and is probably a good compromise to avoid hand-picking those 3000 from the Han set as they are likely to be quite sparse.

Now, I understand and agree that having hard-coded ranges is an extremely bad practice. I think we are left with 2 choices to cope with this:

We throw in the whole Han script with all its ~90000 characters, of which only slightly more than 3% are used in contemporary Japanese texts
We keep the hard-coded range that I inserted now, which contains a smaller subset of Han with still a lot of superfluous information (you would find only 11% of them in contemporary Japanese texts)

I am not particularly against or in favor of any of these 2 options, as long as the "root problem" (Japanese misdetected as Chinese) gets solved, so tell me which one you prefer and I will commit the change accordingly.

wooorm

Thanks for the detailed information! Let’s do it like this!

Fixed Chinese misdetections of Japanese input

eef94dd

Since Japanese regex included only Hiragana and Katakana, every input with more than 50% of Han characters was mistakenly detected as Chinese

wooorm requested changes Oct 26, 2019

View reviewed changes

twm added 2 commits October 28, 2019 11:36

Added tests for jpn-as-cmn misdetections

1d80126

Excerpts from UHDR (articles 3, 8 and 16). Other articles that used to fail are: 18, 22, 24, 25, 27, 28)

Added comment with Unicode Kanji Table link

9dca1a2

lorumic commented Oct 28, 2019

View reviewed changes

wooorm approved these changes Oct 28, 2019

View reviewed changes

wooorm merged commit b64e364 into wooorm:master Oct 28, 2019

wooorm changed the title ~~Fixed Chinese misdetections of Japanese input~~ Fix incorrect results for Japanese Oct 28, 2019

wooorm mentioned this pull request Nov 3, 2019

Some Japanese are detected as Chinese mandarin #63

Closed

kewang mentioned this pull request Apr 7, 2020

Some Chinese sentences are detected as Japanese #84

Open

lorumic mentioned this pull request Nov 13, 2024

Improve detection accuracy for CJK text #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix incorrect results for Japanese #77

Fix incorrect results for Japanese #77

Uh oh!

lorumic commented Oct 25, 2019

Uh oh!

wooorm left a comment

Uh oh!

lorumic left a comment

Uh oh!

wooorm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix incorrect results for Japanese #77

Fix incorrect results for Japanese #77

Uh oh!

Conversation

lorumic commented Oct 25, 2019

Uh oh!

wooorm left a comment

Choose a reason for hiding this comment

Uh oh!

lorumic left a comment

Choose a reason for hiding this comment

Uh oh!

wooorm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants