Skip to content

Conversation

@tingofurro
Copy link
Contributor

Added the Headline Grouping Dataset (HLGD), from the NAACL2021 paper: News Headline Grouping as a Challenging NLU Task
Dataset Link: https://github.com/tingofurro/headline_grouping
Paper link: https://people.eecs.berkeley.edu/~phillab/pdfs/NAACL2021_HLG.pdf

Copy link
Contributor

@bhavitvyamalik bhavitvyamalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really cool! I've commented a few changes that might help you pass the tests. Also, please add dummy_data and make sure this dataset passes real data and dummy data tests. You can find instructions for the same here.

@@ -0,0 +1,192 @@
---
YAML tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove YAML tags:. Removing this will pass your check_code_quality test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright it should be gone!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added the dummy_data and run tests both with real and dummy data!

# This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
# The `key` is here for legacy reason (tfds) and is not important in itself.

with open(filepath, "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add encoding as well when you read the json file. This also causes certain tests to fail.
with open(filepath, encoding="utf-8") as f:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it I've added the encoding!

Comment on lines 4 to 5
extended:
- original
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove this part here

Suggested change
extended:
- original
extended:
- original

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I've removed it but I had build this YAML using this tool: https://huggingface.co/datasets/tagging/
Is it a problem of different versions of the YAML formats?

In any case, it seems to have solved the problem, so thank you for the help figuring it out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this for dataset tagging

@tingofurro
Copy link
Contributor Author

Is there anything else needed from my end?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent thank you !
Good job with the dataset script and the dataset card, they are really good.

I just left three comments:

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Merging since the CI error is unrelated to this PR and fixed on master

@lhoestq lhoestq merged commit 6c5742c into huggingface:master May 12, 2021
@tingofurro
Copy link
Contributor Author

Thanks Bhavitvya and Quentin, this was very streamlined!

JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants