-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Added the HLGD dataset #2325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Added the HLGD dataset #2325
Changes from 7 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
f19111f
Adding README and loading script for HLGD dataset
tingofurro 78bedd1
Update README.md
tingofurro 991d2e0
Ran `make style` and flake8
tingofurro 0aa5fd4
Merge remote-tracking branch 'upstream/master' into hlgd
tingofurro ad5806d
Added example data instance and data fields
tingofurro 5358c23
[HLGD] Fixing bugs in dataset class, fixing typo in README and adding…
tingofurro be66eb6
Update README.md
tingofurro 1151f6d
[HLGD] Changed label names, README cleanup
tingofurro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| --- | ||
| annotations_creators: | ||
| - crowdsourced | ||
| language_creators: | ||
| - expert-generated | ||
| languages: | ||
| - en | ||
| licenses: | ||
| - apache-2.0 | ||
| multilinguality: | ||
| - monolingual | ||
| size_categories: [] | ||
| source_datasets: | ||
| - original | ||
| task_categories: | ||
| - text-classification | ||
| task_ids: | ||
| - text-classification-other-headline-grouping | ||
| size_categories: | ||
| - 10K<n<100K | ||
| --- | ||
|
|
||
| # Dataset Card for HLGD | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Table of Contents | ||
| - [Dataset Card for HLGD](#dataset-card-for-dataset-name) | ||
| - [Table of Contents](#table-of-contents) | ||
| - [Dataset Description](#dataset-description) | ||
| - [Dataset Summary](#dataset-summary) | ||
| - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) | ||
| - [Languages](#languages) | ||
| - [Dataset Structure](#dataset-structure) | ||
| - [Data Instances](#data-instances) | ||
| - [Data Fields](#data-fields) | ||
| - [Data Splits](#data-splits) | ||
| - [Dataset Creation](#dataset-creation) | ||
| - [Curation Rationale](#curation-rationale) | ||
| - [Source Data](#source-data) | ||
| - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) | ||
| - [Who are the source language producers?](#who-are-the-source-language-producers) | ||
| - [Annotations](#annotations) | ||
| - [Annotation process](#annotation-process) | ||
| - [Who are the annotators?](#who-are-the-annotators) | ||
| - [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
| - [Considerations for Using the Data](#considerations-for-using-the-data) | ||
| - [Social Impact of Dataset](#social-impact-of-dataset) | ||
| - [Discussion of Biases](#discussion-of-biases) | ||
| - [Other Known Limitations](#other-known-limitations) | ||
| - [Additional Information](#additional-information) | ||
| - [Dataset Curators](#dataset-curators) | ||
| - [Licensing Information](#licensing-information) | ||
| - [Citation Information](#citation-information) | ||
| - [Contributions](#contributions) | ||
|
|
||
| ## Dataset Description | ||
|
|
||
| - **Homepage:** [https://github.com/tingofurro/headline_grouping](https://github.com/tingofurro/headline_grouping) | ||
| - **Repository:** [https://github.com/tingofurro/headline_grouping](https://github.com/tingofurro/headline_grouping) | ||
| - **Paper:** [https://people.eecs.berkeley.edu/~phillab/pdfs/NAACL2021_HLG.pdf](https://people.eecs.berkeley.edu/~phillab/pdfs/NAACL2021_HLG.pdf) | ||
| - **Leaderboard:** N/A | ||
| - **Point of Contact:** phillab (at) berkeley (dot) edu | ||
|
|
||
| ### Dataset Summary | ||
|
|
||
| HLGD is a binary classification dataset consisting of 20,056 labeled news headlines pairs indicating whether the two headlines describe the same underlying world event or not. The dataset comes with an existing split between `train`, `validation` and `test` (60-20-20). | ||
|
|
||
| ### Supported Tasks and Leaderboards | ||
|
|
||
| The paper (NAACL2021) introducing HLGD proposes three challenges making use of various amounts of data: | ||
| - Challenge 1: Headline-only. Models must make predictions using only the text of both headlines. | ||
| - Challenge 2: Headline + Time. Models must make predictions using the headline and publication date of the two headlines. | ||
| - Challenge 3: Headline + Time + Other. Models can make predictions using the headline, publication date as well as any other relevant meta-data that can be obtained through the URL attached to the headline (full article content, authors, news source, etc.) | ||
|
|
||
| ### Languages | ||
|
|
||
| Dataset is in english. | ||
|
|
||
| ## Dataset Structure | ||
|
|
||
| ### Data Instances | ||
|
|
||
| A typical dataset consists of a timeline_id, two headlines (A/B), each associated with a URL, and a date. Finally, a label indicates whether the two headlines describe the same underlying event (1) or not (0). Below is an example from the training set: | ||
| ``` | ||
| {'timeline_id': 4, | ||
| 'headline_a': 'France fines Google nearly $57 million for first major violation of new European privacy regime', | ||
| 'headline_b': "France hits Google with record EUR50mn fine over 'forced consent' data collection", | ||
| 'date_a': '2019-01-21', | ||
| 'date_b': '2019-01-21', | ||
| 'url_a': 'https://www.chicagotribune.com/business/ct-biz-france-fines-google-privacy-20190121-story.html', | ||
| 'url_b': 'https://www.rt.com/news/449369-france-hits-google-with-record-fine/', | ||
| 'label': 1} | ||
| ``` | ||
|
|
||
| ### Data Fields | ||
|
|
||
| - `timeline_id`: Represents the id of the timeline that the headline pair belongs to (values 0 to 9). The dev set is composed of timelines 0 and 5, and the test set timelines 7 and 8 | ||
| - `headline_a`, `headline_b`: Raw text for the headline pair being compared | ||
| - `date_a`, `date_b`: Publication date of the respective headlines, in the `YYYY-MM-DD` format | ||
| - `url_a`, `url_b`: Original URL of the respective headlines. Can be used to retrieve additional meta-data on the headline. | ||
| - `label`: 1 if the two headlines are part of the the same headline group and describe the same underlying event, 0 otherwise. | ||
|
|
||
| ### Data Splits | ||
|
|
||
| | | Train | Dev | Test | | ||
| | --------------------------- | ------- | ------ | ----- | | ||
| | Number of examples | 15,492 | 2,069 | 2,495 | | ||
|
|
||
| ## Dataset Creation | ||
|
|
||
| ### Curation Rationale | ||
|
|
||
| The task of grouping headlines from diverse news sources discussing a same underlying event is important to enable interfaces that can present the diversity of coverage of unfolding news events. Many news aggregators (such as Google or Yahoo news) present several sources for a given event, with an objective to highlight coverage diversity. | ||
| Automatic grouping of news headlines and articles remains challenging as headlines are short, heavily-stylized texts. | ||
| The HeadLine Grouping Dataset introduces the first benchmark to evaluate NLU model's ability to group headlines according to the underlying event they describe. | ||
|
|
||
|
|
||
| ### Source Data | ||
|
|
||
| #### Initial Data Collection and Normalization | ||
|
|
||
| The data was obtained by collecting 10 news timelines from the NewsLens project by selecting timelines diversified in topic each contained between 80 and 300 news articles. | ||
|
|
||
| #### Who are the source language producers? | ||
|
|
||
| The source language producers are journalists or members of the newsroom of 34 news organizations listed in the paper. | ||
|
|
||
| ### Annotations | ||
|
|
||
| #### Annotation process | ||
|
|
||
| Each timeline was annotated for group IDs by 5 independent annotators. The 5 annotations were merged into a single annotation named the global groups. | ||
| The global group IDs are then used to generate all pairs of headlines within timelines with binary labels: 1 if two headlines are part of the same global group, and 0 otherwise. A heuristic is used to remove negative examples to obtain a final dataset that has class imbalance of 1 positive example to 5 negative examples. | ||
|
|
||
| #### Who are the annotators? | ||
|
|
||
| Annotators were authors of the papers and 8 crowd-workers on the Upwork platform. The crowd-workers were native English speakers with experience either in proof-reading or data-entry. | ||
|
|
||
| ### Personal and Sensitive Information | ||
|
|
||
| Annotators identity has been anonymized. Due to the public nature of news headline, it is not expected that the headlines will contain personal sensitive information. | ||
|
|
||
| ## Considerations for Using the Data | ||
|
|
||
| ### Social Impact of Dataset | ||
|
|
||
| The purpose of this dataset is to facilitate applications that present diverse news coverage. | ||
|
|
||
| By simplifying the process of developing models that can group headlines that describe a common event, we hope the community can build applications that show news readers diverse sources covering similar events. | ||
|
|
||
| We note however that the annotations were performed in majority by crowd-workers and that even though inter-annotator agreement was high, it was not perfect. Bias of the annotators therefore remains in the dataset. | ||
|
|
||
| ### Discussion of Biases | ||
|
|
||
| There are several sources of bias in the dataset: | ||
| - Annotator bias: 10 annotators participated in the creation of the dataset. Their opinions and perspectives influenced the creation of the dataset. | ||
| - Subject matter bias: HLGD consists of headlines from 10 news timelines from diverse topics (space, tech, politics, etc.). This choice has an impact on the types of positive and negative examples that appear in the dataset. | ||
| - Source selection bias: 33 English-language news sources are represented in the dataset. This selection of news sources has an effect on the content in the timeline, and the overall dataset. | ||
| - Time-range of the timelines: the timelines selected range from 2010 to 2020, which has an influence on the language and style of news headlines. | ||
|
|
||
| ### Other Known Limitations | ||
|
|
||
| For the task of Headline Grouping, inter-annotator agreement is high (0.814) but not perfect. Some decisions for headline grouping are subjective and depend on interpretation of the reader. | ||
|
|
||
| ## Additional Information | ||
|
|
||
| ### Dataset Curators | ||
|
|
||
| The dataset was initially created by Philippe Laban, Lucas Bandarkar and Marti Hearst at UC Berkeley. | ||
|
|
||
| ### Licensing Information | ||
|
|
||
| The licensing status of the dataset depends on the legal status of news headlines. It is commonly held that News Headlines fall under "fair-use" ([American Bar blog post](https://www.americanbar.org/groups/gpsolo/publications/gp_solo/2011/september/fair_use_news_reviews/)) | ||
| The dataset only distributes headlines, a URL and a publication date. Users of the dataset can then retrieve additional information (such as the body content, author, etc.) directly by querying the URL. | ||
|
|
||
| ### Citation Information | ||
|
|
||
| ``` | ||
| @inproceedings{Laban2021NewsHG, | ||
| title={News Headline Grouping as a Challenging NLU Task}, | ||
| author={Laban, Philippe and Bandarkar, Lucas and Hearst, Marti A}, | ||
| booktitle={NAACL 2021}, | ||
| publisher = {Association for Computational Linguistics}, | ||
| year={2021} | ||
| } | ||
| ``` | ||
|
|
||
| ### Contributions | ||
|
|
||
| Thanks to [@tingofurro](https://github.com/<tingofurro>) for adding this dataset. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {"default": {"description": "HLGD is a binary classification dataset consisting of 20,056 labeled news headlines pairs indicating\nwhether the two headlines describe the same underlying world event or not.\n", "citation": "@inproceedings{Laban2021NewsHG,\n title={News Headline Grouping as a Challenging NLU Task},\n author={Philippe Laban and Lucas Bandarkar},\n booktitle={NAACL 2021},\n publisher = {Association for Computational Linguistics},\n year={2021}\n}\n", "homepage": "https://github.com/tingofurro/headline_grouping", "license": "Apache-2.0 License", "features": {"timeline_id": {"num_classes": 10, "names": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "names_file": null, "id": null, "_type": "ClassLabel"}, "headline_a": {"dtype": "string", "id": null, "_type": "Value"}, "headline_b": {"dtype": "string", "id": null, "_type": "Value"}, "date_a": {"dtype": "string", "id": null, "_type": "Value"}, "date_b": {"dtype": "string", "id": null, "_type": "Value"}, "url_a": {"dtype": "string", "id": null, "_type": "Value"}, "url_b": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"num_classes": 2, "names": [0, 1], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "hlgd", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 6447212, "num_examples": 15492, "dataset_name": "hlgd"}, "test": {"name": "test", "num_bytes": 941145, "num_examples": 2495, "dataset_name": "hlgd"}, "validation": {"name": "validation", "num_bytes": 798302, "num_examples": 2069, "dataset_name": "hlgd"}}, "download_checksums": {"https://github.com/tingofurro/headline_grouping/releases/download/0.1/hlgd_classification_0.1.zip": {"num_bytes": 1858948, "checksum": "8192c72e28766debf548f0ba1f0b5c3d592cf7097af26a5d67b172c908614601"}}, "download_size": 1858948, "post_processing_size": null, "dataset_size": 8186659, "size_in_bytes": 10045607}} |
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # coding=utf-8 | ||
| # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
| HLGD is a binary classification dataset consisting of 20,056 labeled news headlines pairs indicating | ||
| whether the two headlines describe the same underlying world event or not. | ||
| """ | ||
|
|
||
| import json | ||
| import os | ||
|
|
||
| import datasets | ||
|
|
||
|
|
||
| _CITATION = """\ | ||
| @inproceedings{Laban2021NewsHG, | ||
| title={News Headline Grouping as a Challenging NLU Task}, | ||
| author={Philippe Laban and Lucas Bandarkar}, | ||
| booktitle={NAACL 2021}, | ||
| publisher = {Association for Computational Linguistics}, | ||
| year={2021} | ||
| } | ||
| """ | ||
|
|
||
| _DESCRIPTION = """\ | ||
| HLGD is a binary classification dataset consisting of 20,056 labeled news headlines pairs indicating | ||
| whether the two headlines describe the same underlying world event or not. | ||
| """ | ||
|
|
||
| _HOMEPAGE = "https://github.com/tingofurro/headline_grouping" | ||
| _LICENSE = "Apache-2.0 License" | ||
| _DOWNLOAD_URL = "https://github.com/tingofurro/headline_grouping/releases/download/0.1/hlgd_classification_0.1.zip" | ||
|
|
||
|
|
||
| class HLGD(datasets.GeneratorBasedBuilder): | ||
| """Headline Grouping Dataset.""" | ||
|
|
||
| VERSION = datasets.Version("1.1.0") | ||
|
|
||
| def _info(self): | ||
| features = datasets.Features( | ||
| { | ||
| "timeline_id": datasets.features.ClassLabel(names=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), | ||
| "headline_a": datasets.Value("string"), | ||
| "headline_b": datasets.Value("string"), | ||
| "date_a": datasets.Value("string"), | ||
| "date_b": datasets.Value("string"), | ||
| "url_a": datasets.Value("string"), | ||
| "url_b": datasets.Value("string"), | ||
| "label": datasets.features.ClassLabel(names=[0, 1]), | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| ) | ||
|
|
||
| return datasets.DatasetInfo( | ||
| description=_DESCRIPTION, | ||
| features=features, | ||
| supervised_keys=None, | ||
| homepage=_HOMEPAGE, | ||
| license=_LICENSE, | ||
| citation=_CITATION, | ||
| ) | ||
|
|
||
| def _split_generators(self, dl_manager): | ||
| """Returns SplitGenerators.""" | ||
| # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration | ||
| # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name | ||
|
|
||
| # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLs | ||
| # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files. | ||
| # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive | ||
|
|
||
| data_dir = dl_manager.download_and_extract(_DOWNLOAD_URL) | ||
|
|
||
| return [ | ||
| datasets.SplitGenerator( | ||
| name=datasets.Split.TRAIN, | ||
| gen_kwargs={ | ||
| "filepath": os.path.join(data_dir, "train.json"), | ||
| "split": "train", | ||
| }, | ||
| ), | ||
| datasets.SplitGenerator( | ||
| name=datasets.Split.TEST, | ||
| gen_kwargs={"filepath": os.path.join(data_dir, "test.json"), "split": "test"}, | ||
| ), | ||
| datasets.SplitGenerator( | ||
| name=datasets.Split.VALIDATION, | ||
| gen_kwargs={ | ||
| "filepath": os.path.join(data_dir, "dev.json"), | ||
| "split": "dev", | ||
| }, | ||
| ), | ||
| ] | ||
|
|
||
| def _generate_examples( | ||
| self, filepath, split # method parameters are unpacked from `gen_kwargs` as given in `_split_generators` | ||
| ): | ||
| """Yields examples as (key, example) tuples.""" | ||
| # This method handles input defined in _split_generators to yield (key, example) tuples from the dataset. | ||
| # The `key` is here for legacy reason (tfds) and is not important in itself. | ||
|
|
||
| with open(filepath, encoding="utf-8") as f: | ||
| dataset_split = json.load(f) | ||
|
|
||
| for id_, row in enumerate(dataset_split): | ||
| yield id_, { | ||
| "timeline_id": row["timeline_id"], | ||
| "headline_a": row["headline_a"], | ||
| "headline_b": row["headline_b"], | ||
| "date_a": row["date_a"], | ||
| "date_b": row["date_b"], | ||
| "url_a": row["url_a"], | ||
| "url_b": row["url_b"], | ||
| "label": row["label"], | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.