huggingface · lhoestq · Jul 29, 2021 · Jun 10, 2021 · Jun 12, 2021 · Jun 17, 2021
diff --git a/datasets/disfl_qa/README.md b/datasets/disfl_qa/README.md
@@ -0,0 +1,179 @@
+---
+annotations_creators:
+- expert-generated
+language_creators:
+- found
+languages:
+- en-US
+licenses:
+- cc-by-4.0
+multilinguality:
+- monolingual
+pretty_name: 'DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question
+  Answering'
+size_categories:
+- unknown
+source_datasets:
+- original
+task_categories:
+- question-answering
+task_ids:
+- extractive-qa
+- open-domain-qa
+---
+
+# Dataset Card for DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
+
+## Table of Contents
+- [Table of Contents](#table-of-contents)
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** [Disfl-QA](https://github.com/google-research-datasets/disfl-qa)
+- **Paper:** [Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering](https://arxiv.org/pdf/2106.04016.pdf)
+- **Point of Contact:** [disfl-qa team]([email protected])
+
+### Dataset Summary
+
+Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking  setting, namely question answering over Wikipedia passages.  Disfl-QA builds upon the SQuAD-v2 ([Rajpurkar et al., 2018](https://www.aclweb.org/anthology/P18-2124/)) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.
+
+The final dataset consists of ~12k (disfluent question, answer) pairs. Over 90\% of the disfluencies are corrections or restarts, making it a much harder test set for disfluency correction. Disfl-QA aims to fill a major gap between speech and NLP research community. The authors hope the dataset can serve as a benchmark dataset for testing robustness of models against disfluent inputs.
+
+The expriments reveal that the state-of-the-art models are brittle when subjected to disfluent inputs from Disfl-QA. Detailed experiments and analyses can be found in the [paper](https://arxiv.org/pdf/2106.04016.pdf).
+
+### Supported Tasks and Leaderboards
+
+[More Information Needed]
+
+### Languages
+
+The dataset is in English only.
+
+## Dataset Structure
+
+### Data Instances
+
+This example was too long and was cropped:
+```
+{
+    "answers": {
+        "answer_start": [94, 87, 94, 94],
+        "text": ["10th and 11th centuries", "in the 10th and 11th centuries", "10th and 11th centuries", "10th and 11th centuries"]
+    },
+    "context": "\"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave thei...",
+    "id": "56ddde6b9a695914005b9629",
+    "original question": "When were the Normans in Normandy?",
+    "disfluent question": "From which countries no tell me when were the Normans in Normandy?"
+    "title": "Normans"
+}
+```
+### Data Fields
+
+- `id`: a `string` feature.
+- `title`: a `string` feature.
+- `context`: a `string` feature.
+- `original question`: Original question from SQuAD-v2 (a `string` feature)
+- `disfluent question`: Disfluent question from Disfl-QA (a `string` feature)
+- `answers`: a dictionary feature containing:
+  - `text`: a `string` feature.
+  - `answer_start`: a `int32` feature.
+
+### Data Splits
+
+Disfl-QA consists of ~12k disfluent questions with the following train/dev/test splits:
+| File      | Questions   |
+|-----|-----|
+|train.json  | 7182  |
+|dev.json  | 1000   |
+|test.json  | 3643  |
+
+## Dataset Creation
+
+### Curation Rationale
+
+The research in NLP and speech community has been impeded by the lack of curated datasets containing such disfluencies. The datasets available today are mostly conversational in nature, and span a limited number of very specific domains (e.g., telephone conversations, court proceedings). Furthermore, only a small fraction of the utterances in these datasets contain disfluencies, with a limited and skewed distribution of disfluencies types. In the most popular dataset in the literature, the SWITCHBOARD corpus (Godfrey et al., 1992), only 5.9% of the words are disfluencies (Charniak and Johnson, 2001), of which > 50% are repetitions (Shriberg, 1996), which has been shown to be the relatively simpler form of disfluencies (Zayats et al., 2014; Jamshid Lou et al., 2018; Zayats et al., 2019). To fill this gap, the authors presented DISFL-QA, the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages.
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+DISFL-QA is constructed by asking human raters to insert disfluencies in questions from SQUAD-v2, a popular question answering dataset, using the passage and remaining questions as context. These contextual disfluencies lend naturalness to DISFL-QA, and challenge models relying on shallow matching between question and context to predict an answer.
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+Each question associated with the paragraph is sent for a human annotation task to add a contextual disfluency using the paragraph as a source of distractors. Finally, to ensure the quality of the dataset, a subsequent round of human evaluation with an option to re-annotate is conducted.
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+Disfl-QA dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
+
+### Citation Information
+
+```
+@inproceedings{gupta-etal-2021-disflqa,
+    title = "{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}",
+    author = "Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal",
+    booktitle = "Findings of ACL",
+    year = "2021"
+}
+```
+
+### Contributions
+
+Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.
diff --git a/datasets/disfl_qa/dataset_infos.json b/datasets/disfl_qa/dataset_infos.json
@@ -0,0 +1 @@
+{"default": {"description": "Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting,\nnamely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018)\ndataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as\na source of distractors.\n\nThe final dataset consists of ~12k (disfluent question, answer) pairs. Over 90% of the disfluencies are\ncorrections or restarts, making it a much harder test set for disfluency correction. Disfl-QA aims to fill a\nmajor gap between speech and NLP research community. We hope the dataset can serve as a benchmark dataset for\ntesting robustness of models against disfluent inputs.\n\nOur expriments reveal that the state-of-the-art models are brittle when subjected to disfluent inputs from\nDisfl-QA. Detailed experiments and analyses can be found in our paper.\n", "citation": "@inproceedings{gupta-etal-2021-disflqa,\n    title = \"{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}\",\n    author = \"Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal\",\n    booktitle = \"Findings of ACL\",\n    year = \"2021\"\n}\n\n", "homepage": "https://github.com/google-research-datasets/disfl-qa", "license": "Disfl-QA dataset is licensed under CC BY 4.0", "features": {"squad_v2_id": {"dtype": "string", "id": null, "_type": "Value"}, "original question": {"dtype": "string", "id": null, "_type": "Value"}, "disfluent question": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "question-answering-extractive", "question_column": "disfluent question", "context_column": "context", "answers_column": "answers"}], "builder_name": "disfl_qa", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 7712523, "num_examples": 7182, "dataset_name": "disfl_qa"}, "test": {"name": "test", "num_bytes": 3865097, "num_examples": 3643, "dataset_name": "disfl_qa"}, "validation": {"name": "validation", "num_bytes": 1072731, "num_examples": 1000, "dataset_name": "disfl_qa"}}, "download_checksums": {"https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json": {"num_bytes": 42123633, "checksum": "68dcfbb971bd3e96d5b46c7177b16c1a4e7d4bdef19fb204502738552dede002"}, "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json": {"num_bytes": 4370528, "checksum": "80a5225e94905956a6446d296ca1093975c4d3b3260f1d6c8f68bc2ab77182d8"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/train.json": {"num_bytes": 1467771, "checksum": "5407199d0c039de5b50cfc16d1ed4a3299c9127cb549da4e4a650b30f4e000eb"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/test.json": {"num_bytes": 771364, "checksum": "404801de916ebcb2caa82661dfd189c0520e2766db6838f6ff548088650e565e"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/dev.json": {"num_bytes": 201742, "checksum": "b60e075b810b27a5130fd0aa2cfbc85753b71a0b30dcd2585f540f0a6afe6492"}}, "download_size": 48935038, "post_processing_size": null, "dataset_size": 12650351, "size_in_bytes": 61585389}}
diff --git a/datasets/disfl_qa/disfl_qa.py b/datasets/disfl_qa/disfl_qa.py
@@ -0,0 +1,198 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A Benchmark Dataset for Understanding Disfluencies in Question Answering"""
+
+
+import json
+
+import datasets
+from datasets.tasks import QuestionAnsweringExtractive
+
+
+_CITATION = """\
+@inproceedings{gupta-etal-2021-disflqa,
+    title = "{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}",
+    author = "Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal",
+    booktitle = "Findings of ACL",
+    year = "2021"
+}
+
+"""
+
+_DESCRIPTION = """\
+Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting,
+namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018)
+dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as
+a source of distractors.
+
+The final dataset consists of ~12k (disfluent question, answer) pairs. Over 90% of the disfluencies are
+corrections or restarts, making it a much harder test set for disfluency correction. Disfl-QA aims to fill a
+major gap between speech and NLP research community. We hope the dataset can serve as a benchmark dataset for
+testing robustness of models against disfluent inputs.
+
+Our expriments reveal that the state-of-the-art models are brittle when subjected to disfluent inputs from
+Disfl-QA. Detailed experiments and analyses can be found in our paper.
+"""
+
+_HOMEPAGE = "https://github.com/google-research-datasets/disfl-qa"
+
+_LICENSE = "Disfl-QA dataset is licensed under CC BY 4.0"
+
+_URL = "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/"
+
+_URLS_squad_v2 = {
+    "train": "https://rajpurkar.github.io/SQuAD-explorer/dataset/" + "train-v2.0.json",
+    "dev": "https://rajpurkar.github.io/SQuAD-explorer/dataset/" + "dev-v2.0.json",
+}
+
+
+class DisflQA(datasets.GeneratorBasedBuilder):
+    """A Benchmark Dataset for Understanding Disfluencies in Question Answering"""
+
+    VERSION = datasets.Version("1.1.0")
+
+    def _info(self):
+        features = datasets.Features(
+            {
+                "squad_v2_id": datasets.Value("string"),
+                "original question": datasets.Value("string"),
+                "disfluent question": datasets.Value("string"),
+                "title": datasets.Value("string"),
+                "context": datasets.Value("string"),
+                "answers": datasets.features.Sequence(
+                    {
+                        "text": datasets.Value("string"),
+                        "answer_start": datasets.Value("int32"),
+                    }
+                ),
+            }
+        )
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # This defines the different columns of the dataset and their types
+            features=features,  # Here we define them above because they are different between the two configurations
+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,
+            # Homepage of the dataset for documentation
+            homepage=_HOMEPAGE,
+            # License for the dataset if available
+            license=_LICENSE,
+            # Citation for the dataset
+            citation=_CITATION,
+            task_templates=[
+                QuestionAnsweringExtractive(
+                    question_column="disfluent question", context_column="context", answers_column="answers"
+                )
+            ],
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+
+        squad_v2_downloaded_files = dl_manager.download_and_extract(_URLS_squad_v2)
+        merge_squad_v2_json = {}
+
+        for file in squad_v2_downloaded_files:
+            with open(squad_v2_downloaded_files[file], encoding="utf-8") as f:
+                merge_squad_v2_json.update(json.load(f))
+
+        squad_v2_dict = _helper_dict(merge_squad_v2_json)  # contains all squad_v2 data in a dict with id as key
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": dl_manager.download_and_extract(_URL + "train.json"),
+                    "split": "train",
+                    "squad_v2_dict": squad_v2_dict,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": dl_manager.download_and_extract(_URL + "test.json"),
+                    "split": "test",
+                    "squad_v2_dict": squad_v2_dict,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": dl_manager.download_and_extract(_URL + "dev.json"),
+                    "split": "dev",
+                    "squad_v2_dict": squad_v2_dict,
+                },
+            ),
+        ]
+
+    def _generate_examples(
+        self,
+        filepath,
+        split,
+        squad_v2_dict,  # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    ):
+        """Yields examples as (key, example) tuples."""
+
+        with open(filepath, encoding="utf-8") as f:
+            glob_id = 0
+            for id_, row in enumerate(f):
+                data = json.loads(row)
+                for i in data:
+                    yield glob_id, {
+                        "squad_v2_id": i,
+                        "disfluent question": data[i]["disfluent"],
+                        "title": squad_v2_dict[i]["title"],
+                        "context": squad_v2_dict[i]["context"],
+                        "original question": squad_v2_dict[i]["question"],
+                        "answers": {
+                            "answer_start": squad_v2_dict[i]["answers"]["answer_start"],
+                            "text": squad_v2_dict[i]["answers"]["text"],
+                        },
+                    }
+                    glob_id += 1
+
+
+def _helper_dict(row_squad_v2: dict):  # creates dict with id as key for combined squad_v2
+
+    squad_v2_dict = {}
+
+    for example in row_squad_v2["data"]:
+        title = example.get("title", "").strip()
+        for paragraph in example["paragraphs"]:
+            context = paragraph["context"].strip()
+            for qa in paragraph["qas"]:
+                question = qa["question"].strip()
+                id_ = qa["id"]
+
+                answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                answers = [answer["text"].strip() for answer in qa["answers"]]
+
+                squad_v2_dict[id_] = {
+                    "title": title,
+                    "context": context,
+                    "question": question,
+                    "id": id_,
+                    "answers": {
+                        "answer_start": answer_starts,
+                        "text": answers,
+                    },
+                }
+    return squad_v2_dict
diff --git a/datasets/disfl_qa/dummy/1.1.0/dummy_data.zip b/datasets/disfl_qa/dummy/1.1.0/dummy_data.zip
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"default": {"description": "Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting,\nnamely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018)\ndataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as\na source of distractors.\n\nThe final dataset consists of ~12k (disfluent question, answer) pairs. Over 90% of the disfluencies are\ncorrections or restarts, making it a much harder test set for disfluency correction. Disfl-QA aims to fill a\nmajor gap between speech and NLP research community. We hope the dataset can serve as a benchmark dataset for\ntesting robustness of models against disfluent inputs.\n\nOur expriments reveal that the state-of-the-art models are brittle when subjected to disfluent inputs from\nDisfl-QA. Detailed experiments and analyses can be found in our paper.\n", "citation": "@inproceedings{gupta-etal-2021-disflqa,\n title = \"{Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering}\",\n author = \"Gupta, Aditya and Xu, Jiacheng and Upadhyay, Shyam and Yang, Diyi and Faruqui, Manaal\",\n booktitle = \"Findings of ACL\",\n year = \"2021\"\n}\n\n", "homepage": "https://github.com/google-research-datasets/disfl-qa", "license": "Disfl-QA dataset is licensed under CC BY 4.0", "features": {"squad_v2_id": {"dtype": "string", "id": null, "_type": "Value"}, "original question": {"dtype": "string", "id": null, "_type": "Value"}, "disfluent question": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "question-answering-extractive", "question_column": "disfluent question", "context_column": "context", "answers_column": "answers"}], "builder_name": "disfl_qa", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 7712523, "num_examples": 7182, "dataset_name": "disfl_qa"}, "test": {"name": "test", "num_bytes": 3865097, "num_examples": 3643, "dataset_name": "disfl_qa"}, "validation": {"name": "validation", "num_bytes": 1072731, "num_examples": 1000, "dataset_name": "disfl_qa"}}, "download_checksums": {"https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json": {"num_bytes": 42123633, "checksum": "68dcfbb971bd3e96d5b46c7177b16c1a4e7d4bdef19fb204502738552dede002"}, "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json": {"num_bytes": 4370528, "checksum": "80a5225e94905956a6446d296ca1093975c4d3b3260f1d6c8f68bc2ab77182d8"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/train.json": {"num_bytes": 1467771, "checksum": "5407199d0c039de5b50cfc16d1ed4a3299c9127cb549da4e4a650b30f4e000eb"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/test.json": {"num_bytes": 771364, "checksum": "404801de916ebcb2caa82661dfd189c0520e2766db6838f6ff548088650e565e"}, "https://raw.githubusercontent.com/google-research-datasets/Disfl-QA/main/dev.json": {"num_bytes": 201742, "checksum": "b60e075b810b27a5130fd0aa2cfbc85753b71a0b30dcd2585f540f0a6afe6492"}}, "download_size": 48935038, "post_processing_size": null, "dataset_size": 12650351, "size_in_bytes": 61585389}}