Skip to content

Conversation

@jungwhank
Copy link
Contributor

Add KLUE (Korean Language Understanding Evaluation) dataset released recently from paper, github and webpage.
Please let me know if there's anything missing in the code or README.
Thanks!

@jungwhank
Copy link
Contributor Author

jungwhank commented May 30, 2021

I'm not sure why I got error like below when I auto-generate dummy data "mrc"

datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 0
Keys should be unique and deterministic in nature

@bzantium
Copy link
Contributor

bzantium commented May 31, 2021

I'm not sure why I got error like below when I auto-generate dummy data "mrc"

datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 0
Keys should be unique and deterministic in nature

Please check out the suggestion below. I think it might be a cause.

@jungwhank
Copy link
Contributor Author

jungwhank commented May 31, 2021

I'm not sure why I got error like below when I auto-generate dummy data "mrc"

datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 0
Keys should be unique and deterministic in nature

Please check out the suggestion below. I think it might be a cause.

The problem was id_ in mrc when yield was not unique. (I used index in enumerate(paragraphs) by mistake)
I fixed it and update all the things

Co-authored-by: Minho Ryu <[email protected]>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you ! You did a really great job adding the dataset.
The dataset card and the python scripts are really good.

I just added a few comments.
After some changes regarding features you will probably need to regenerate the dataset_infos.json file

Also I noticed that some of the dummy data are bigger than 20KB, could you try to reduce their sizes please ? For example the mrc dummy data file is 200KB. I think this is because it contains data for several tens of examples for each split. In the dummy data we expect to have less than 5 examples so that they can be loaded quickly.

@lhoestq
Copy link
Member

lhoestq commented Jun 4, 2021

To fix the CI you can just merge master into your branch and it should be all green hopefully :)

@jungwhank
Copy link
Contributor Author

jungwhank commented Jun 4, 2021

@lhoestq
Thanks for reviewing!

It's harder than I thought to add dataset card. 😅
I checked and updated your suggestion (script, readme details, dummy data).

dummy data is little bit larger than expected because ner dataset is about 80 lines and dp dataset is about 25 lines to avoid 0 examples.

I'm not sure why some CI keep fails, can u check for this?

@lhoestq
Copy link
Member

lhoestq commented Jun 4, 2021

Thanks ! That makes sense for ner and dp

For mrc on the other hand there are still too many examples, maybe you can generate the dummy data for 5 examples for all tasks except ner and dp ?

@jungwhank
Copy link
Contributor Author

jungwhank commented Jun 4, 2021

Thanks ! That makes sense for ner and dp

For mrc on the other hand there are still too many examples, maybe you can generate the dummy data for 5 examples for all tasks except ner and dp ?

Yes, I generate default lines in dataset-cli for other dataset except "dp" and "ner"
I fixed mrc dataset, hope it's fine now :)

the reason CI failed was I forgot to merge master into my branch 😅

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a LOT ! This looks all good to me now :)

@lhoestq lhoestq merged commit ede1bbd into huggingface:master Jun 4, 2021
@jungwhank jungwhank deleted the klue branch June 9, 2021 15:00
@ingyuseong ingyuseong mentioned this pull request Jul 3, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants