-
-
Notifications
You must be signed in to change notification settings - Fork 183
Fix incorrect results for Japanese #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Since Japanese regex included only Hiragana and Katakana, every input with more than 50% of Han characters was mistakenly detected as Chinese
wooorm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this looks cool! Thanks for working on this!! ✨
Could you:
a) provide a test for this? Something that used to fail, but works now?
b) Could it be possible to generate the Kanji range as well? Preferably from unicode-7.0.0 or so?
(aside: I should update that to the latest Unicode)
Excerpts from UHDR (articles 3, 8 and 16). Other articles that used to fail are: 18, 22, 24, 25, 27, 28)
lorumic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Titus, thanks for reviewing the pull request. I will answer to your questions here:
a) Sure thing. I have added tests for 3 of the 9 articles of the Japanese UHDR that, taken alone, used to fail (misdetected as Chinese) and will instead work after the changes.
b) I wish I could, but as far as I know there is no Unicode script that separates Japanese kanji from the rest of Han.
So, looking at the latest list of all Unicode scripts here, Japanese kanji fall in these 2 subsets of Han:
3400..4DB5 ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FEF ; Han # Lo [20976] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FEF
Also, the second subset contains around 50 characters that are not part of the "actual" Japanese kanji range that I found here instead. I have added the link to the Unicode Kanji Table to the build.js file as well, for reference.
Among these almost 27000 characters, the official ones according to the Japanese Ministry of Education are slightly more than 2000, plus another ~1000 (constantly being updated by the Ministry of Justice) used for registered names, so around 3000 in total, but the Rikai kanji table was the most reliable source that I could find and is probably a good compromise to avoid hand-picking those 3000 from the Han set as they are likely to be quite sparse.
Now, I understand and agree that having hard-coded ranges is an extremely bad practice. I think we are left with 2 choices to cope with this:
-
We throw in the whole Han script with all its ~90000 characters, of which only slightly more than 3% are used in contemporary Japanese texts
-
We keep the hard-coded range that I inserted now, which contains a smaller subset of Han with still a lot of superfluous information (you would find only 11% of them in contemporary Japanese texts)
I am not particularly against or in favor of any of these 2 options, as long as the "root problem" (Japanese misdetected as Chinese) gets solved, so tell me which one you prefer and I will commit the change accordingly.
wooorm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed information! Let’s do it like this!
Since Japanese regex included only Hiragana and Katakana, every input containing kana but more than 50% of Han characters was mistakenly detected as Chinese.
This fix is necessary because many Japanese texts (especially technical ones) may have a kanji ratio higher than 50%. Since the detect method counts the occurrences of matching expressions (characters, in this case), for Japanese it would match only kana (< 50% of total characters), hence returning Chinese as the top language. But if there is even just one kana, the text should be detected as Japanese.
A trivial example:
持込申請最悪状態完了
contains Han characters only so I would agree with it being detected as Chinese, but if I insert the hiragana は particle as follows:
持込申請は最悪状態完了
it should be detected as Japanese, because it is a mostly correct Japanese sentence (abbreviated, if anything), but instead franc reports it as Chinese, due to the fact that 90% of the characters are Han and only 10% is kana.
In order to prevent that, I have added the regex for Japanese kanji, which in my opinion is cleaner than just throwing in the whole Han block, so that when detecting also kanji would be matched for Japanese. So, in the examples above, the first case would still be (correctly) detected as Chinese, but the second one would have a 90% score for Chinese and a 100% score for Japanese, hence being correctly detected as Japanese.