CLDR-json data extraction script #28

nciric · 2024-04-05T00:49:13Z

Created a script that pulls data from cldr-json (month names and day names) for sr, es, fr, de, en languages.

I appeneded grammatical info, which is correct for sr, but it's probably wrong in day names for other languages. Also, some grammatical information may not be needed for some languages, e.g. I am not sure if animacy matters in English, or are there genders for days/months?

We can land it as is, then file an issue to fix by others.

…al data.

JelenaMitrovic · 2024-04-06T12:02:38Z

I checked German, English, and Serbian. German looks fine to me - easy cause all month and day names are masculine.

For English, we do not need gender I would say, but we might keep it as a placeholder, as natural gender might still be needed for some applications.

Serbian looks good. Are we using Cyrillic only, or should we also include Latin script as they are used more or less equally? I am not sure if Cyrillic only is used in localisation scenarios.

nciric · 2024-04-06T15:48:47Z

Thanks! Serbian Cyrillic is enough as we can generate Latin using ICU transliterator.

…

On Sat, Apr 6, 2024, 05:03 Jelena Mitrović ***@***.***> wrote: I checked German, English, and Serbian. German looks fine to me - easy cause all month and day names are masculine. For English, we do not need gender I would say, but we might keep it as a placeholder, as natural gender might still be needed for some applications. Serbian looks good. Are we using Cyrillic only, or should we also include Latin script as they are used more or less equally? I am not sure if Cyrillic only is used in localisation scenarios. — Reply to this email directly, view it on GitHub <#28 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA7GEKSZVE55XXFFZYDMHE3Y37P7JAVCNFSM6AAAAABFYHO6DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGA3DGMZZGQ> . You are receiving this because you were assigned.Message ID: ***@***.***>

nciric · 2024-04-08T17:10:11Z

From Bruno Cartoni:
"Fr looks good, and Spanish seems to be ok.

However, we need to discuss a fundamental question: what kind of grammatical feature do we want to include in all language lexicon (as you touch a little bit with "gender" in English).

IMO, we should describe each language with the features that are needed, not with all the available features. E.g. French doesn't have "case", nor "animacy" features, so they don't need to be listed in nominal entries. "

So French has 2 attributes too many (case and animacy). We shouldn't carry attributes that don't make sense for the language. I will remove them in the PR.

While this PR is far from the final lexicon format we'll use, I do want to add some features, like exceptions (not as a part of this PR).

En: goose->geese, fish->fish...
Sr: 'ruka' (hand): ['ruka', 'ruke', 'ruci', 'ruku', 'ruko', 'rukom', 'ruci', 'ruke', 'ruka', 'rukama', 'ruke', 'ruke', 'rukama', 'rukama']

How do we encode exceptions, or various forms of words so we can look them up in the lexicon.

grhoten · 2024-04-13T06:38:45Z

data/de/lexicon.txt

@@ -0,0 +1,19 @@
+Januar;N;MASC;NOM;SG;INAN


Is there a way to use some other format? This is the kind of format that I typically use.

Januar: singular accusative dative nominative masculine noun Januare: plural accusative genitive nominative masculine noun Januaren: plural dative masculine noun Januars: singular genitive masculine noun

I omitted some mapping details between these entries, but that's mostly how I store this type of data. I also use an XML format that I convert to the syntax above. The XML structure is inverted to the style that I listed above.

Some other formats have been discussed in this group too.

As you expand to languages with numerous grammatical cases, it becomes hard to keep track of how these shouting abbreviations in uppercase map to a grammeme that has to be pronounced in a meeting or for a new person to interact with the data.

The nice feature about using XML is that it's easier to filter and annotate the data as necessary. This becomes important for languages that use bounded morphemes a lot (no whitespace separation) and use different values for the grammatical number for each morpheme. For example, "your book" in Arabic is a single word with 2 morphemes. The "your" can be singular, dual or plural. The "your" can be masculine or feminine for pronunciation, depending on who you're addressing. That possessive pronoun binds to the noun being possessed. The word "book" has a gender too, and it can be turned into singular, dual or plural.

I was going for a simple format that's easy to prototype. Should we use XML from the start? We would need to agree on the basic structure. I'll change the structure now and then we can iterate.

Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun

Changed format to this for now. PTAL.

Looks great to me! Thanks.

Could you approve the PR?

richgillam · 2024-04-16T00:31:40Z

data/tools/extract_cldr_data.py

+    for expression in EXPRESSIONS:
+        match = expression.find(json_data)
+        for m in match:
+            results.append(m.value + ': noun masculine nominative singular inanimate\n')


This doesn't match what I see in the files above. Are you hand-editing them after the fact, or am I misunderstanding something?

I generate files with all the grammatical forms, but then delete unnecessary information (depending on the language). So there's no bug - we can expand the tool to know more about each language in the future, and reduce manual work. I leave that to the reader :).

Ah, so you do hand-edit. I wasn't objecting; I was just worried I'd missed something. :-)

Kutzaki · 2024-04-18T14:59:53Z

In the data for Spanish, these day names are both singular & plural:
lunes;N;MASC;NOM;SG;INAN
martes;N;MASC;NOM;SG;INAN
miércoles;N;MASC;NOM;SG;INAN
jueves;N;MASC;NOM;SG;INAN
viernes;N;MASC;NOM;SG;INAN

Kutzaki · 2024-04-29T20:06:01Z

data/es/lexicon.txt

+noviembre;N;MASC;NOM;SG;INAN
+diciembre;N;MASC;NOM;SG;INAN
+domingo;N;MASC;NOM;SG;INAN
+lunes;N;MASC;NOM;SG;INAN


Add PL to lunes, martes, miércoles, jueves and viernes

nciric added 2 commits April 4, 2024 17:48

CLDR-json data extraction script

9b79c17

Reading, writing CLDR data and appending somewhat incorrect grammatic…

35b405b

…al data.

nciric self-assigned this Apr 5, 2024

nciric requested review from grhoten and richgillam April 5, 2024 03:00

Removing case and animacy from French

d52c1ac

grhoten reviewed Apr 13, 2024

View reviewed changes

Make grammatical categories more readable.

de7653f

nciric requested a review from grhoten April 15, 2024 16:57

macchiati approved these changes Apr 15, 2024

View reviewed changes

richgillam reviewed Apr 16, 2024

View reviewed changes

nciric merged commit fbc1f8b into main Apr 16, 2024

nciric deleted the cira-data branch April 16, 2024 02:06

Kutzaki reviewed Apr 29, 2024

View reviewed changes

Uh oh!

CLDR-json data extraction script #28

CLDR-json data extraction script #28

Uh oh!

Conversation

nciric commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelenaMitrovic commented Apr 6, 2024

Uh oh!

nciric commented Apr 6, 2024 via email

Uh oh!

nciric commented Apr 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kutzaki commented Apr 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nciric commented Apr 5, 2024 •

edited

Loading