Bbaw egyptian #2290

phiwi · 2021-04-29T15:27:58Z

This is the "hieroglyph corpus" that I could unfortunately not contribute during the marathon. I re-extracted it again now, so that it is in the state as used in my paper (seee documentation). I hope it satiesfies your requirements and wish every scientist out their loads of fun deciphering a 5.000 years old language :-)

…gyptian

albertvillanova · 2021-05-03T08:05:31Z

Hi @phiwi,

Thanks for contributing this nice dataset. If you have any blocking problem or question, do not hesitate to ask here. We are pleased to help you.

Could you please first synchronize with our master branch? From your branch bbaw_egyptian, type:

git fetch upstream master
git merge upstream/master

albertvillanova

You should also remove the file datasets/dummy/0.0.0/dummy_data.zip, because you have already attached the dummy data in datasets/bbaw_egyptian/dummy/0.0.0/dummy_data.zip.

datasets/bbaw_egyptian/bbaw_egyptian.py

lhoestq

Thanks for adding this one :)

I left a few comments
Also could you remove the file at datasets/dummy/0.0.0/dummy_data.zip please ?

lhoestq · 2021-05-03T08:29:17Z

datasets/bbaw_egyptian/README.md

+annotations_creators:
+- specialized egyptologists
+language_creators:
+- found
+languages:
+- de, en, eg
+licenses:
+- cc-by-4.0
+multilinguality:
+- multilingual
+size_categories:
+- 100K<n<1000K
+source_datasets:
+- extended|wikipedia
+task_categories:
+- translation


specialized egyptologists is not a valid annotations_creators tag. You can use this instead:

annotations_creators: - expert-generated

There is a tool to create those tags here.

For the languages, there should be one language per line:

- de - en - eg

Finally the task_ids tags are missing:

task_categories: - conditional-text-generation task_ids: - machine-translation

lhoestq · 2021-05-03T08:31:20Z

datasets/bbaw_egyptian/README.md

+}
+```
+
+### Contributions


Suggested change

### Contributions

### Contributions

Thanks to [@phiwi](https://github.com/phiwi) for adding this dataset.

lhoestq · 2021-05-03T08:33:16Z

datasets/bbaw_egyptian/bbaw_egyptian.py

+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        my_urls = self._URLS
+        data_dir = dl_manager.download_and_extract(my_urls)


There's no extraction

Suggested change

data_dir = dl_manager.download_and_extract(my_urls)

data_dir = dl_manager.download(my_urls)

lhoestq · 2021-05-03T17:43:30Z

Thanks ! Can you check that you have black==21.4b0 and run make style again ? This should fix the "check_code_quality" CI issue

phiwi · 2021-05-03T17:56:23Z

Reformatted with black.

albertvillanova · 2021-05-05T07:18:58Z

Hi @phiwi, there are still some minor problems in relation with the tags you used in the dataset card (README.md).

Here you can find the output of the metadata validator:

WARNING:root:❌ Failed to validate 'datasets/bbaw_egyptian/README.md':
Could not validate the metada, found the following errors:
* field 'size_categories':
	['100K<n<1000K'] are not registered tags for 'size_categories', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/size_categories.json
* field 'task_ids':
	['machine translation'] are not registered tags for 'task_ids', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/tasks.json
* field 'languages':
	['eg'] are not registered tags for 'languages', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/languages.json

phiwi · 2021-05-05T08:14:21Z

@albertvillanova corrected :-)

albertvillanova · 2021-05-05T08:19:17Z

Thanks, @phiwi. Now all tests should pass green.

However, I think there is still an issue with the language code:

the code for the Ancient Egyptian is not ar-EG
there is no ISO 639-1 code for the Ancient Egyptian
there is an ISO 639-2 code: egy; but this code will not pass the validation test because it is not in the list of valid codes

I am not sure what to do in this case... Maybe @lhoestq has an idea? Maybe adding the code to the list? https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/languages.json

albertvillanova · 2021-05-05T08:25:59Z

I have just checked that in the list of valid codes there are already ISO 639-2 codes. Therefore, I would suggest you to add it to the list:

"egy": "Egyptian (Ancient)",

and change it in the dataset card.

phiwi · 2021-05-05T12:27:48Z

Done.

phiwi · 2021-05-06T09:33:01Z

Hope, everything is okay right now.

albertvillanova

It looks good to me. Let's see if @lhoestq has any other suggestions before merging it to master.

lhoestq

Looks all good now thanks !

Philipp and others added 17 commits April 30, 2021 14:46

adding bbaw egyptian dataset

ffe2750

adding data

120aed2

Update README.md

fa5cc41

Update README.md

45da487

Update README.md

7b6cc02

readme update

268e4d3

update README

4301acc

update

8cb7221

adding bbaw egyptian dataset

a0ddf28

adding data

fd47497

Update README.md

634ace7

Update README.md

922ef7b

Update README.md

b1970bd

readme update

f12187a

update README

ed32fd7

update

3813111

Merge branch 'bbaw_egyptian' of github.com:phiwi/datasets into bbaw_e…

51ca65a

…gyptian

albertvillanova reviewed May 3, 2021

View reviewed changes

datasets/bbaw_egyptian/bbaw_egyptian.py Outdated Show resolved Hide resolved

lhoestq reviewed May 3, 2021

View reviewed changes

Philipp added 3 commits May 3, 2021 17:00

Merge remote-tracking branch 'upstream/master' into bbaw_egyptian

3fb9066

minor code formatting and deletions

22aa06f

minor formatting fix

ee88d60

reformatting with black

66fa6ae

adjusting size, task_ids and languages to correct values

a35ec72

adding Ancient Egyptian (egy) to languages and changed dataset card

a17c618

albertvillanova approved these changes May 6, 2021

View reviewed changes

lhoestq approved these changes May 6, 2021

View reviewed changes

lhoestq merged commit 056b432 into huggingface:master May 6, 2021

	data_dir = dl_manager.download_and_extract(my_urls)
	data_dir = dl_manager.download(my_urls)

Bbaw egyptian #2290

Bbaw egyptian #2290

Uh oh!

Conversation

phiwi commented Apr 29, 2021

Uh oh!

albertvillanova commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq May 3, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq May 3, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq May 3, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq commented May 3, 2021

Uh oh!

phiwi commented May 3, 2021

Uh oh!

albertvillanova commented May 5, 2021

Uh oh!

phiwi commented May 5, 2021

Uh oh!

albertvillanova commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phiwi commented May 5, 2021

Uh oh!

phiwi commented May 6, 2021

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertvillanova commented May 3, 2021 •

edited

Loading

albertvillanova commented May 5, 2021 •

edited

Loading

albertvillanova commented May 5, 2021 •

edited

Loading