-
Notifications
You must be signed in to change notification settings - Fork 3k
Bbaw egyptian #2290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bbaw egyptian #2290
Conversation
|
Hi @phiwi, Thanks for contributing this nice dataset. If you have any blocking problem or question, do not hesitate to ask here. We are pleased to help you. Could you please first synchronize with our master branch? From your branch |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also remove the file datasets/dummy/0.0.0/dummy_data.zip, because you have already attached the dummy data in datasets/bbaw_egyptian/dummy/0.0.0/dummy_data.zip.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this one :)
I left a few comments
Also could you remove the file at datasets/dummy/0.0.0/dummy_data.zip please ?
datasets/bbaw_egyptian/README.md
Outdated
| annotations_creators: | ||
| - specialized egyptologists | ||
| language_creators: | ||
| - found | ||
| languages: | ||
| - de, en, eg | ||
| licenses: | ||
| - cc-by-4.0 | ||
| multilinguality: | ||
| - multilingual | ||
| size_categories: | ||
| - 100K<n<1000K | ||
| source_datasets: | ||
| - extended|wikipedia | ||
| task_categories: | ||
| - translation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specialized egyptologists is not a valid annotations_creators tag. You can use this instead:
annotations_creators:
- expert-generated
There is a tool to create those tags here.
For the languages, there should be one language per line:
- de
- en
- eg
Finally the task_ids tags are missing:
task_categories:
- conditional-text-generation
task_ids:
- machine-translation
| } | ||
| ``` | ||
|
|
||
| ### Contributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Contributions | |
| ### Contributions | |
| Thanks to [@phiwi](https://github.com/phiwi) for adding this dataset. |
| def _split_generators(self, dl_manager): | ||
| """Returns SplitGenerators.""" | ||
| my_urls = self._URLS | ||
| data_dir = dl_manager.download_and_extract(my_urls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no extraction
| data_dir = dl_manager.download_and_extract(my_urls) | |
| data_dir = dl_manager.download(my_urls) |
|
Thanks ! Can you check that you have |
|
Reformatted with black. |
|
Hi @phiwi, there are still some minor problems in relation with the tags you used in the dataset card (README.md). Here you can find the output of the metadata validator: |
|
@albertvillanova corrected :-) |
|
Thanks, @phiwi. Now all tests should pass green. However, I think there is still an issue with the language code:
I am not sure what to do in this case... Maybe @lhoestq has an idea? Maybe adding the code to the list? https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/languages.json |
|
I have just checked that in the list of valid codes there are already ISO 639-2 codes. Therefore, I would suggest you to add it to the list: and change it in the dataset card. |
|
Done. |
|
Hope, everything is okay right now. |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Let's see if @lhoestq has any other suggestions before merging it to master.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now thanks !
This is the "hieroglyph corpus" that I could unfortunately not contribute during the marathon. I re-extracted it again now, so that it is in the state as used in my paper (seee documentation). I hope it satiesfies your requirements and wish every scientist out their loads of fun deciphering a 5.000 years old language :-)