Skip to content

Conversation

@grhoten
Copy link
Member

@grhoten grhoten commented Jan 27, 2025

Resolves #72

Some of the inflection variants aren't handled well. Some changes are needed to improve handling them. Here are some examples.

  1. The theater and theatre lemmas (L7083) need separate inflection tables.
  2. Provide a way to combine multiple languages, since language isn't pure.
    1. Handle multiple Norwegian variants.
    2. Allow combining of Serbian and Croatian.
    3. Allow combining of phonetic information for Korean and English for improved Korean particle usage.
    4. and so on...
  3. Improve the dictionary-parser speed to run in less time through better data filtering. Irrelevant data extraction is skipped when necessary. For English, this was about 5 seconds or 25% faster. It runs in about 15 seconds for English on my machine.

static final String IGNORE_UNANNOTATED_SURFACE_FORM = "--ignore-unannotated-entries";
static final String ADD_NORMALIZED_ENTRY = "--add-normalized-entry";
static final String LOCALE_OPT = "--locale";
static final String LANGUAGE_OPT = "--language";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that I changed this option. It's now a comma separated list.

.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true)
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
var lexParser = new ParseWikidata(parserOptions);
LexemesJsonDeserializer.setLanguage(parserOptions.locales);
Copy link
Member Author

@grhoten grhoten Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to register a module in Jackson with parameters. So I did it this way with a static variable.

@grhoten grhoten merged commit 0d65f00 into unicode-org:main Jan 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve multi-language handling

2 participants