-
-
Notifications
You must be signed in to change notification settings - Fork 17
CLDR-json data extraction script #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I checked German, English, and Serbian. German looks fine to me - easy cause all month and day names are masculine. For English, we do not need gender I would say, but we might keep it as a placeholder, as natural gender might still be needed for some applications. Serbian looks good. Are we using Cyrillic only, or should we also include Latin script as they are used more or less equally? I am not sure if Cyrillic only is used in localisation scenarios. |
|
Thanks! Serbian Cyrillic is enough as we can generate Latin using ICU
transliterator.
…On Sat, Apr 6, 2024, 05:03 Jelena Mitrović ***@***.***> wrote:
I checked German, English, and Serbian. German looks fine to me - easy
cause all month and day names are masculine.
For English, we do not need gender I would say, but we might keep it as a
placeholder, as natural gender might still be needed for some applications.
Serbian looks good. Are we using Cyrillic only, or should we also include
Latin script as they are used more or less equally? I am not sure if
Cyrillic only is used in localisation scenarios.
—
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7GEKSZVE55XXFFZYDMHE3Y37P7JAVCNFSM6AAAAABFYHO6DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGA3DGMZZGQ>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
|
From Bruno Cartoni: However, we need to discuss a fundamental question: what kind of grammatical feature do we want to include in all language lexicon (as you touch a little bit with "gender" in English). IMO, we should describe each language with the features that are needed, not with all the available features. E.g. French doesn't have "case", nor "animacy" features, so they don't need to be listed in nominal entries. " So French has 2 attributes too many (case and animacy). We shouldn't carry attributes that don't make sense for the language. I will remove them in the PR. While this PR is far from the final lexicon format we'll use, I do want to add some features, like exceptions (not as a part of this PR). En: goose->geese, fish->fish... How do we encode exceptions, or various forms of words so we can look them up in the lexicon. |
data/de/lexicon.txt
Outdated
| @@ -0,0 +1,19 @@ | |||
| Januar;N;MASC;NOM;SG;INAN | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to use some other format? This is the kind of format that I typically use.
Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun
I omitted some mapping details between these entries, but that's mostly how I store this type of data. I also use an XML format that I convert to the syntax above. The XML structure is inverted to the style that I listed above.
Some other formats have been discussed in this group too.
As you expand to languages with numerous grammatical cases, it becomes hard to keep track of how these shouting abbreviations in uppercase map to a grammeme that has to be pronounced in a meeting or for a new person to interact with the data.
The nice feature about using XML is that it's easier to filter and annotate the data as necessary. This becomes important for languages that use bounded morphemes a lot (no whitespace separation) and use different values for the grammatical number for each morpheme. For example, "your book" in Arabic is a single word with 2 morphemes. The "your" can be singular, dual or plural. The "your" can be masculine or feminine for pronunciation, depending on who you're addressing. That possessive pronoun binds to the noun being possessed. The word "book" has a gender too, and it can be turned into singular, dual or plural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going for a simple format that's easy to prototype. Should we use XML from the start? We would need to agree on the basic structure. I'll change the structure now and then we can iterate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Januar: singular accusative dative nominative masculine noun
Januare: plural accusative genitive nominative masculine noun
Januaren: plural dative masculine noun
Januars: singular genitive masculine noun
Changed format to this for now. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you approve the PR?
| for expression in EXPRESSIONS: | ||
| match = expression.find(json_data) | ||
| for m in match: | ||
| results.append(m.value + ': noun masculine nominative singular inanimate\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match what I see in the files above. Are you hand-editing them after the fact, or am I misunderstanding something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generate files with all the grammatical forms, but then delete unnecessary information (depending on the language). So there's no bug - we can expand the tool to know more about each language in the future, and reduce manual work. I leave that to the reader :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so you do hand-edit. I wasn't objecting; I was just worried I'd missed something. :-)
|
In the data for Spanish, these day names are both singular & plural: |
data/es/lexicon.txt
Outdated
| noviembre;N;MASC;NOM;SG;INAN | ||
| diciembre;N;MASC;NOM;SG;INAN | ||
| domingo;N;MASC;NOM;SG;INAN | ||
| lunes;N;MASC;NOM;SG;INAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add PL to lunes, martes, miércoles, jueves and viernes
Created a script that pulls data from cldr-json (month names and day names) for sr, es, fr, de, en languages.
I appeneded grammatical info, which is correct for sr, but it's probably wrong in day names for other languages. Also, some grammatical information may not be needed for some languages, e.g. I am not sure if animacy matters in English, or are there genders for days/months?
We can land it as is, then file an issue to fix by others.