Skip to content

Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [LUCENE-4056] #5128

@asfimport

Description

@asfimport

I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.

The following is my procedure:
Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below.

build-dict:
[java] dictionary builder
[java]
[java] dictionary format: UNIDIC
[java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
[java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
[java] input encoding: utf-8
[java] normalize entries: false
[java]
[java] building tokeninfo dict...
[java] parse...
[java] sort...
[java] Exception in thread "main" java.lang.AssertionError
[java] encode...
[java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

And the diff of build.xml:

===================================================================
— build.xml (revision 1338023)
+++ build.xml (working copy)
@@ -28,19 +28,31 @@
<property name="maven.dist.dir" location="../../../dist/maven" />

<!-- default configuration: uses mecab-ipadic -->

    • <!-- alternative configuration: uses UniDic -->
    • <property name="ipadic.version" value="unidic-mecab1312src" />
    • <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
    • <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
    • <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
    • <!--
      <property name="dict.encoding" value="euc-jp"/>
      <property name="dict.format" value="ipadic"/>
    • -->
    • <property name="dict.encoding" value="utf-8"/>
    • <property name="dict.format" value="unidic"/>
    • <property name="dict.normalize" value="false"/>
      <property name="dict.target.dir" location="./src/resources"/>

@@ -58,7 +70,8 @@

<target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
<target name="download-dict" unless="dict.available">

  • &lt;get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/&gt;
    
    • &lt;!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ --&gt;
      
    • &lt;copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/&gt;
      &lt;gunzip src="${build.dir}/${dict.src.file}"/&gt;
      &lt;untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/&gt;
      
      </target>

Migrated from LUCENE-4056 by Kazuaki Hiraga (@hkazuakey), updated Oct 16 2019
Environment:

Solr 3.6
UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)

Attachments: LUCENE-4056.patch
Linked issues:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions