LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary#935
LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary#935johtani wants to merge 3 commits intoapache:masterfrom
Conversation
| <artifact name="mecab-naist-jdic" type=".tar.gz" url=" https://rwthaachen.dl.osdn.jp/naist-jdic/53500/mecab-naist-jdic-0.6.3b-20111013.tar.gz"/> | ||
| </dependency> | ||
| <dependency org="mecab" name="mecab-unidic" rev="${/mecab/mecab-unidic}" conf="unidic"> | ||
| <artifact name="unidic" type=".zip" url=" http://ja.osdn.net/frs/redir.php?m=iij&f=unidic%2F58338%2Funidic-mecab-2.1.2_src.zip"/> |
There was a problem hiding this comment.
If we download external files we must somewhere include a license demonstrating that those files are positively licensed for re-use, ideally with some recognized open source license. Do you know what the status is of this source for unidic? What is osdn.net?
There was a problem hiding this comment.
About licensing, UniDic is distributed under GPL/LGPL/BSD(3-clause) triple licenses: https://unidic.ninjal.ac.jp/back_number#unidic_bccwj (Note: in Japanese)
According to my knowledge, the last one is compatible with ASF's policy.
There was a problem hiding this comment.
OK, great. We should add a reference to the unidic license to lucene/NOTICE.txt then, where we have some statement about ipadic already, and possibly we should copy the license into lucene/licenses/ - I'm not sure if that's needed.
|
I understand there are no tests. Could you at least state for the record that you were able to build both UNIDIC and IPADIC-based dictionaries after your change using |
Yes it's a problem for future maintenance, I think we may need some kind of validator for binary encoded dictionaries rather than directly writing unit tests for the builder. |
|
At least, I should paste the messages build successful for Unidic and Ipadic. |
|
Here is the message for And also the message with unidic and |
…nary Fix some errors during building UniDic 2.1.2 dictionary
…nary Fix precommit error
…nary Add unidic license to NOTICE.txt
|
Hello Team, May I inquire where are we on this? TL;DRIn the meanwhile, I attempted and succeeded to build the following dictionaries. I am using the tweaks that @johtani added in his PR three years ago, plus a few minor tweaks of my own:
See the attached screenshots displaying the generated files👇🏼 I have also tested the built dictionaries, see Testing section below 👇🏼 Shall I try make a new PR under https://github.com/apache/lucene in order to get a conversation re-started on this? cc: @mocobeta 🙇🏼♀️ BuildingCaveatRE: Building the unidic-cwj-3.1.1-full:
Gradle commandsThe following has been performed on the fresh clone of https://github.com/apache/lucene: My build command leveraged the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it. I added in My shell Gradle commands to build the dictionaries are as follows. The commands were executed under the root directory Building unidic-cwj-202302_full commandBuilding unidic-cwj-3.1.1-full commandBuilding unidic-mecab-2.1.2_src commandBuilding mecab-ipadic-2.7.0-20070801 commandTestingThe built dictionaries were tested using the following Japanese strings:
The dictionaries metadata (dictionary by dictionary) were places under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added: Existing default dictionaryBuilt unidic-cwj-202302_fullI need to increase memory before running the tests, as Built unidic-cwj-3.1.1-fullBuilt unidic-mecab-2.1.2_srcBuilt mecab-ipadic-2.7.0-20070801ScreenshotsBuilt unidic-cwj-202302_full
Built unidic-mecab-2.1.2_src
Built mecab-ipadic-2.7.0-20070801
|
|
Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right? |
@msokolov hello! Thank you. Yes, I figured that PR will need to be exported eventually under the right repo, thus, I went ahead and did it: To use this feature, yes, Lucene analysis Kuromoji JAR will have to be rebuilt after rebuilding a new dictionary. For example, while under the root lucene/ directory where the
To extend this example a little further, more specifically - using this together with SolrMy aforementioned PR under Lucene adds the new changes using the latest state of code from the If you are running Solr in Docker, then the PR changes must be adapted to the Lucene v9.7.0 branch, which is what the latest Solr is running on. This is something that I have done when I was rebuilding and testing the Lucene library. I have a patch lucene_9.7.0_kuromoji_unidic_3_compatibility that adapts the PR changes to the v9.7.0 branch. Thus, doing the above steps inside the Dockerfile, you could then replace the default Solr's |



Description
UniDic has a bit different column and unk.def from ipadic.
Currently if we use UniDic, then we get some error for building dictionary.
We can build UniDic after changing build.xml. I left some comments in build.xml.
If we use UniDic, we should change stoptags.txt and also add an example for part of speech filter.
UniDic has different part of speech tags from ipadic.
Solution
Added some logic for adjusting UniDic csv and unk.def.
And also changing build.xml for downloading and building UniDic.
Tests
Unfortunately there is no tests for DictionaryBuilder.
Checklist
Please review the following and check all that apply:
masterbranch.ant precommitand the appropriate test suite.