Skip to content

Conversation

@nciric
Copy link
Contributor

@nciric nciric commented Apr 5, 2024

Created a script that pulls data from cldr-json (month names and day names) for sr, es, fr, de, en languages.

I appeneded grammatical info, which is correct for sr, but it's probably wrong in day names for other languages. Also, some grammatical information may not be needed for some languages, e.g. I am not sure if animacy matters in English, or are there genders for days/months?

We can land it as is, then file an issue to fix by others.

@nciric nciric self-assigned this Apr 5, 2024
@nciric nciric requested review from grhoten and richgillam April 5, 2024 03:00
@JelenaMitrovic
Copy link

I checked German, English, and Serbian. German looks fine to me - easy cause all month and day names are masculine.

For English, we do not need gender I would say, but we might keep it as a placeholder, as natural gender might still be needed for some applications.

Serbian looks good. Are we using Cyrillic only, or should we also include Latin script as they are used more or less equally? I am not sure if Cyrillic only is used in localisation scenarios.

@nciric
Copy link
Contributor Author

nciric commented Apr 6, 2024 via email

@nciric
Copy link
Contributor Author

nciric commented Apr 8, 2024

From Bruno Cartoni:
"Fr looks good, and Spanish seems to be ok.

However, we need to discuss a fundamental question: what kind of grammatical feature do we want to include in all language lexicon (as you touch a little bit with "gender" in English).

IMO, we should describe each language with the features that are needed, not with all the available features. E.g. French doesn't have "case", nor "animacy" features, so they don't need to be listed in nominal entries. "

So French has 2 attributes too many (case and animacy). We shouldn't carry attributes that don't make sense for the language. I will remove them in the PR.

While this PR is far from the final lexicon format we'll use, I do want to add some features, like exceptions (not as a part of this PR).

En: goose->geese, fish->fish...
Sr: 'ruka' (hand): ['ruka', 'ruke', 'ruci', 'ruku', 'ruko', 'rukom', 'ruci', 'ruke', 'ruka', 'rukama', 'ruke', 'ruke', 'rukama', 'rukama']

How do we encode exceptions, or various forms of words so we can look them up in the lexicon.

@@ -0,0 +1,19 @@
Januar;N;MASC;NOM;SG;INAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use some other format? This is the kind of format that I typically use.

Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun

I omitted some mapping details between these entries, but that's mostly how I store this type of data. I also use an XML format that I convert to the syntax above. The XML structure is inverted to the style that I listed above.

Some other formats have been discussed in this group too.

As you expand to languages with numerous grammatical cases, it becomes hard to keep track of how these shouting abbreviations in uppercase map to a grammeme that has to be pronounced in a meeting or for a new person to interact with the data.

The nice feature about using XML is that it's easier to filter and annotate the data as necessary. This becomes important for languages that use bounded morphemes a lot (no whitespace separation) and use different values for the grammatical number for each morpheme. For example, "your book" in Arabic is a single word with 2 morphemes. The "your" can be singular, dual or plural. The "your" can be masculine or feminine for pronunciation, depending on who you're addressing. That possessive pronoun binds to the noun being possessed. The word "book" has a gender too, and it can be turned into singular, dual or plural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going for a simple format that's easy to prototype. Should we use XML from the start? We would need to agree on the basic structure. I'll change the structure now and then we can iterate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun

Changed format to this for now. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you approve the PR?

@nciric nciric requested a review from grhoten April 15, 2024 16:57
for expression in EXPRESSIONS:
match = expression.find(json_data)
for m in match:
results.append(m.value + ': noun masculine nominative singular inanimate\n')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match what I see in the files above. Are you hand-editing them after the fact, or am I misunderstanding something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generate files with all the grammatical forms, but then delete unnecessary information (depending on the language). So there's no bug - we can expand the tool to know more about each language in the future, and reduce manual work. I leave that to the reader :).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you do hand-edit. I wasn't objecting; I was just worried I'd missed something. :-)

@nciric nciric merged commit fbc1f8b into main Apr 16, 2024
@nciric nciric deleted the cira-data branch April 16, 2024 02:06
@Kutzaki
Copy link

Kutzaki commented Apr 18, 2024

In the data for Spanish, these day names are both singular & plural:
lunes;N;MASC;NOM;SG;INAN
martes;N;MASC;NOM;SG;INAN
miércoles;N;MASC;NOM;SG;INAN
jueves;N;MASC;NOM;SG;INAN
viernes;N;MASC;NOM;SG;INAN

noviembre;N;MASC;NOM;SG;INAN
diciembre;N;MASC;NOM;SG;INAN
domingo;N;MASC;NOM;SG;INAN
lunes;N;MASC;NOM;SG;INAN
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add PL to lunes, martes, miércoles, jueves and viernes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants