-
Notifications
You must be signed in to change notification settings - Fork 3k
Added the HLGD dataset #2325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the HLGD dataset #2325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really cool! I've commented a few changes that might help you pass the tests. Also, please add dummy_data and make sure this dataset passes real data and dummy data tests. You can find instructions for the same here.
datasets/hlgd/README.md
Outdated
| @@ -0,0 +1,192 @@ | |||
| --- | |||
| YAML tags: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove YAML tags:. Removing this will pass your check_code_quality test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright it should be gone!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also added the dummy_data and run tests both with real and dummy data!
datasets/hlgd/hlgd.py
Outdated
| # This method handles input defined in _split_generators to yield (key, example) tuples from the dataset. | ||
| # The `key` is here for legacy reason (tfds) and is not important in itself. | ||
|
|
||
| with open(filepath, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add encoding as well when you read the json file. This also causes certain tests to fail.
with open(filepath, encoding="utf-8") as f:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it I've added the encoding!
datasets/hlgd/README.md
Outdated
| extended: | ||
| - original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should remove this part here
| extended: | |
| - original | |
| extended: | |
| - original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I've removed it but I had build this YAML using this tool: https://huggingface.co/datasets/tagging/
Is it a problem of different versions of the YAML formats?
In any case, it seems to have solved the problem, so thank you for the help figuring it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this for dataset tagging
|
Is there anything else needed from my end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent thank you !
Good job with the dataset script and the dataset card, they are really good.
I just left three comments:
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Merging since the CI error is unrelated to this PR and fixed on master
|
Thanks Bhavitvya and Quentin, this was very streamlined! |
Added the HLGD dataset (huggingface#2325)
Added the Headline Grouping Dataset (HLGD), from the NAACL2021 paper: News Headline Grouping as a Challenging NLU Task
Dataset Link: https://github.com/tingofurro/headline_grouping
Paper link: https://people.eecs.berkeley.edu/~phillab/pdfs/NAACL2021_HLG.pdf