Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions data/de/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Januar: noun masculine nominative singular
Februar: noun masculine nominative singular
März: noun masculine nominative singular
April: noun masculine nominative singular
Mai: noun masculine nominative singular
Juni: noun masculine nominative singular
Juli: noun masculine nominative singular
August: noun masculine nominative singular
September: noun masculine nominative singular
Oktober: noun masculine nominative singular
November: noun masculine nominative singular
Dezember: noun masculine nominative singular
Sonntag: noun masculine nominative singular
Montag: noun masculine nominative singular
Dienstag: noun masculine nominative singular
Mittwoch: noun masculine nominative singular
Donnerstag: noun masculine nominative singular
Freitag: noun masculine nominative singular
Samstag: noun masculine nominative singular
19 changes: 19 additions & 0 deletions data/en/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
January: noun singular
February: noun singular
March: noun singular
April: noun singular
May: noun singular
June: noun singular
July: noun singular
August: noun singular
September: noun singular
October: noun singular
November: noun singular
December: noun singular
Sunday: noun singular
Monday: noun singular
Tuesday: noun singular
Wednesday: noun singular
Thursday: noun singular
Friday: noun singular
Saturday: noun singular
19 changes: 19 additions & 0 deletions data/es/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
enero: noun masculine nominative singular inanimate
febrero: noun masculine nominative singular inanimate
marzo: noun masculine nominative singular inanimate
abril: noun masculine nominative singular inanimate
mayo: noun masculine nominative singular inanimate
junio: noun masculine nominative singular inanimate
julio: noun masculine nominative singular inanimate
agosto: noun masculine nominative singular inanimate
septiembre: noun masculine nominative singular inanimate
octubre: noun masculine nominative singular inanimate
noviembre: noun masculine nominative singular inanimate
diciembre: noun masculine nominative singular inanimate
domingo: noun masculine nominative singular inanimate
lunes: noun masculine nominative singular inanimate
martes: noun masculine nominative singular inanimate
miércoles: noun masculine nominative singular inanimate
jueves: noun masculine nominative singular inanimate
viernes: noun masculine nominative singular inanimate
sábado: noun masculine nominative singular inanimate
19 changes: 19 additions & 0 deletions data/fr/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
janvier: noun masculine singular
février: noun masculine singular
mars: noun masculine singular
avril: noun masculine singular
mai: noun masculine singular
juin: noun masculine singular
juillet: noun masculine singular
août: noun masculine singular
septembre: noun masculine singular
octobre: noun masculine singular
novembre: noun masculine singular
décembre: noun masculine singular
dimanche: noun masculine singular
lundi: noun masculine singular
mardi: noun masculine singular
mercredi: noun masculine singular
jeudi: noun masculine singular
vendredi: noun masculine singular
samedi: noun masculine singular
19 changes: 19 additions & 0 deletions data/sr/lexicon.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
јануар: noun masculine nominative singular inanimate
фебруар: noun masculine nominative singular inanimate
март: noun masculine nominative singular inanimate
април: noun masculine nominative singular inanimate
мај: noun masculine nominative singular inanimate
јун: noun masculine nominative singular inanimate
јул: noun masculine nominative singular inanimate
август: noun masculine nominative singular inanimate
септембар: noun masculine nominative singular inanimate
октобар: noun masculine nominative singular inanimate
новембар: noun masculine nominative singular inanimate
децембар: noun masculine nominative singular inanimate
недеља: noun feminine nominative singular inanimate
понедељак: noun masculine nominative singular inanimate
уторак: noun masculine nominative singular inanimate
среда: noun feminine nominative singular inanimate
четвртак: noun masculine nominative singular inanimate
петак: noun masculine nominative singular inanimate
субота: noun feminine nominative singular inanimate
84 changes: 84 additions & 0 deletions data/tools/extract_cldr_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"""
Extracts data from CLDR-JSON repository, e.g. nouns like month or day names.
Script either creates a new inflection file, or appends data to existing one.

Part of Speech, Gender, Case, Number. Animacy

Run script from data folder.

Before running the script clone cldr-json repository:

gh repo clone unicode-org/cldr-json

and install jsonpath-ng package:

pip install jsonpath-ng
"""

import argparse
import json
import os

from jsonpath_ng import jsonpath, parse


def load_json(filename):
"""Loads JSON data from the specified file.

Args:
filename: The name of the JSON file.

Returns:
The parsed JSON data.
"""

try:
with open(filename, 'r', encoding='utf-8') as file:
return json.load(file)
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
return None


def write_to_lexicon(output_file, language, json_data):
"""Extracts specified data from cldr-json file
and writes it to the lexicon file.

Args:
output_file: name of the lexicon.
language: cldr-json file language.
json_data: parsed cldr-json data.
"""
MONTH_NAMES_EXPRESSION = parse('main..dates.calendars.gregorian.months.format.wide.*')
DAY_NAMES_EXPRESSION = parse('main..dates.calendars.gregorian.days.format.wide.*')
EXPRESSIONS = [MONTH_NAMES_EXPRESSION, DAY_NAMES_EXPRESSION]

results = []
for expression in EXPRESSIONS:
match = expression.find(json_data)
for m in match:
results.append(m.value + ': noun masculine nominative singular inanimate\n')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match what I see in the files above. Are you hand-editing them after the fact, or am I misunderstanding something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generate files with all the grammatical forms, but then delete unnecessary information (depending on the language). So there's no bug - we can expand the tool to know more about each language in the future, and reduce manual work. I leave that to the reader :).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you do hand-edit. I wasn't objecting; I was just worried I'd missed something. :-)


full_filename = os.path.join(language, output_file)
try:
os.makedirs(os.path.dirname(full_filename), exist_ok=True)
with open(full_filename, 'a', encoding='utf-8') as file:
file.writelines(results)
except FileNotFoundError:
print(f"Error: file '{output_file}' can't be created.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Load and process CLDR-JSON files for given languages.')
parser.add_argument('--cldr_root', help='The path to CLDR-JSON data.', default='../../cldr-json/cldr-json/cldr-dates-full/main')
parser.add_argument('--input_file', help='Data file to read from, e.g. ca-gregorian.json.', default='ca-gregorian.json')
parser.add_argument('--output_file', help='Data file to create/append to, e.g. lexicon.txt.', default='lexicon.txt')
parser.add_argument('--language_list', nargs='+', default=['sr', 'en', 'de', 'es', 'fr'])
args = parser.parse_args()

for language in args.language_list:
full_filename = os.path.join(args.cldr_root, language, args.input_file)
data = load_json(full_filename)

if data:
write_to_lexicon(args.output_file, language, data)