Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions data/de/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Januar;N;MASC;NOM;SG;INAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use some other format? This is the kind of format that I typically use.

Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun

I omitted some mapping details between these entries, but that's mostly how I store this type of data. I also use an XML format that I convert to the syntax above. The XML structure is inverted to the style that I listed above.

Some other formats have been discussed in this group too.

As you expand to languages with numerous grammatical cases, it becomes hard to keep track of how these shouting abbreviations in uppercase map to a grammeme that has to be pronounced in a meeting or for a new person to interact with the data.

The nice feature about using XML is that it's easier to filter and annotate the data as necessary. This becomes important for languages that use bounded morphemes a lot (no whitespace separation) and use different values for the grammatical number for each morpheme. For example, "your book" in Arabic is a single word with 2 morphemes. The "your" can be singular, dual or plural. The "your" can be masculine or feminine for pronunciation, depending on who you're addressing. That possessive pronoun binds to the noun being possessed. The word "book" has a gender too, and it can be turned into singular, dual or plural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going for a simple format that's easy to prototype. Should we use XML from the start? We would need to agree on the basic structure. I'll change the structure now and then we can iterate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun

Changed format to this for now. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you approve the PR?

Februar;N;MASC;NOM;SG;INAN
März;N;MASC;NOM;SG;INAN
April;N;MASC;NOM;SG;INAN
Mai;N;MASC;NOM;SG;INAN
Juni;N;MASC;NOM;SG;INAN
Juli;N;MASC;NOM;SG;INAN
August;N;MASC;NOM;SG;INAN
September;N;MASC;NOM;SG;INAN
Oktober;N;MASC;NOM;SG;INAN
November;N;MASC;NOM;SG;INAN
Dezember;N;MASC;NOM;SG;INAN
Sonntag;N;MASC;NOM;SG;INAN
Montag;N;MASC;NOM;SG;INAN
Dienstag;N;MASC;NOM;SG;INAN
Mittwoch;N;MASC;NOM;SG;INAN
Donnerstag;N;MASC;NOM;SG;INAN
Freitag;N;MASC;NOM;SG;INAN
Samstag;N;MASC;NOM;SG;INAN
19 changes: 19 additions & 0 deletions data/en/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
January;N;MASC;NOM;SG;INAN
February;N;MASC;NOM;SG;INAN
March;N;MASC;NOM;SG;INAN
April;N;MASC;NOM;SG;INAN
May;N;MASC;NOM;SG;INAN
June;N;MASC;NOM;SG;INAN
July;N;MASC;NOM;SG;INAN
August;N;MASC;NOM;SG;INAN
September;N;MASC;NOM;SG;INAN
October;N;MASC;NOM;SG;INAN
November;N;MASC;NOM;SG;INAN
December;N;MASC;NOM;SG;INAN
Sunday;N;MASC;NOM;SG;INAN
Monday;N;MASC;NOM;SG;INAN
Tuesday;N;MASC;NOM;SG;INAN
Wednesday;N;MASC;NOM;SG;INAN
Thursday;N;MASC;NOM;SG;INAN
Friday;N;MASC;NOM;SG;INAN
Saturday;N;MASC;NOM;SG;INAN
19 changes: 19 additions & 0 deletions data/es/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
enero;N;MASC;NOM;SG;INAN
febrero;N;MASC;NOM;SG;INAN
marzo;N;MASC;NOM;SG;INAN
abril;N;MASC;NOM;SG;INAN
mayo;N;MASC;NOM;SG;INAN
junio;N;MASC;NOM;SG;INAN
julio;N;MASC;NOM;SG;INAN
agosto;N;MASC;NOM;SG;INAN
septiembre;N;MASC;NOM;SG;INAN
octubre;N;MASC;NOM;SG;INAN
noviembre;N;MASC;NOM;SG;INAN
diciembre;N;MASC;NOM;SG;INAN
domingo;N;MASC;NOM;SG;INAN
lunes;N;MASC;NOM;SG;INAN
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add PL to lunes, martes, miércoles, jueves and viernes

martes;N;MASC;NOM;SG;INAN
miércoles;N;MASC;NOM;SG;INAN
jueves;N;MASC;NOM;SG;INAN
viernes;N;MASC;NOM;SG;INAN
sábado;N;MASC;NOM;SG;INAN
19 changes: 19 additions & 0 deletions data/fr/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
janvier;N;MASC;SG
février;N;MASC;SG
mars;N;MASC;SG
avril;N;MASC;SG
mai;N;MASC;SG
juin;N;MASC;SG
juillet;N;MASC;SG
août;N;MASC;SG
septembre;N;MASC;SG
octobre;N;MASC;SG
novembre;N;MASC;SG
décembre;N;MASC;SG
dimanche;N;MASC;SG
lundi;N;MASC;SG
mardi;N;MASC;SG
mercredi;N;MASC;SG
jeudi;N;MASC;SG
vendredi;N;MASC;SG
samedi;N;MASC;SG
19 changes: 19 additions & 0 deletions data/sr/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
јануар;N;MASC;NOM;SG;INAN
фебруар;N;MASC;NOM;SG;INAN
март;N;MASC;NOM;SG;INAN
април;N;MASC;NOM;SG;INAN
мај;N;MASC;NOM;SG;INAN
јун;N;MASC;NOM;SG;INAN
јул;N;MASC;NOM;SG;INAN
август;N;MASC;NOM;SG;INAN
септембар;N;MASC;NOM;SG;INAN
октобар;N;MASC;NOM;SG;INAN
новембар;N;MASC;NOM;SG;INAN
децембар;N;MASC;NOM;SG;INAN
недеља;N;FEM;NOM;SG;INAN
понедељак;N;MASC;NOM;SG;INAN
уторак;N;MASC;NOM;SG;INAN
среда;N;FEM;NOM;SG;INAN
четвртак;N;MASC;NOM;SG;INAN
петак;N;MASC;NOM;SG;INAN
субота;N;FEM;NOM;SG;INAN
85 changes: 85 additions & 0 deletions data/tools/extract_cldr_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""
Extracts data from CLDR-JSON repository, e.g. nouns like month or day names.
Script either creates a new inflection file, or appends data to existing one.
The nomenclature is taken from https://unimorph.github.io/doc/unimorph-schema.pdf (see Appendix)

Part of Speech;Gender;Case;Number;Animacy

Run script from data folder.

Before running the script clone cldr-json repository:

gh repo clone unicode-org/cldr-json

and install jsonpath-ng package:

pip install jsonpath-ng
"""

import argparse
import json
import os

from jsonpath_ng import jsonpath, parse


def load_json(filename):
"""Loads JSON data from the specified file.

Args:
filename: The name of the JSON file.

Returns:
The parsed JSON data.
"""

try:
with open(filename, 'r', encoding='utf-8') as file:
return json.load(file)
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
return None


def write_to_lexicon(output_file, language, json_data):
"""Extracts specified data from cldr-json file
and writes it to the lexicon file.

Args:
output_file: name of the lexicon.
language: cldr-json file language.
json_data: parsed cldr-json data.
"""
MONTH_NAMES_EXPRESSION = parse('main..dates.calendars.gregorian.months.format.wide.*')
DAY_NAMES_EXPRESSION = parse('main..dates.calendars.gregorian.days.format.wide.*')
EXPRESSIONS = [MONTH_NAMES_EXPRESSION, DAY_NAMES_EXPRESSION]

results = []
for expression in EXPRESSIONS:
match = expression.find(json_data)
for m in match:
results.append(m.value + ';N;MASC;NOM;SG;INAN\n')

full_filename = os.path.join(language, output_file)
try:
os.makedirs(os.path.dirname(full_filename), exist_ok=True)
with open(full_filename, 'a', encoding='utf-8') as file:
file.writelines(results)
except FileNotFoundError:
print(f"Error: file '{output_file}' can't be created.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Load and process CLDR-JSON files for given languages.')
parser.add_argument('--cldr_root', help='The path to CLDR-JSON data.', default='../../cldr-json/cldr-json/cldr-dates-full/main')
parser.add_argument('--input_file', help='Data file to read from, e.g. ca-gregorian.json.', default='ca-gregorian.json')
parser.add_argument('--output_file', help='Data file to create/append to, e.g. lexicon.txt.', default='lexicon.txt')
parser.add_argument('--language_list', nargs='+', default=['sr', 'en', 'de', 'es', 'fr'])
args = parser.parse_args()

for language in args.language_list:
full_filename = os.path.join(args.cldr_root, language, args.input_file)
data = load_json(full_filename)

if data:
write_to_lexicon(args.output_file, language, data)