Skip to content

Commit ed5d172

Browse files
committed
Add qasper dataset
1 parent bd3654a commit ed5d172

4 files changed

Lines changed: 336 additions & 0 deletions

File tree

datasets/qasper/README.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
---
2+
annotations_creators:
3+
- expert-generated
4+
source_datasets:
5+
- original
6+
language_creators:
7+
- expert-generated
8+
languages:
9+
- en-US
10+
licenses:
11+
- cc-by-4.0
12+
multilinguality:
13+
- monolingual
14+
size_categories:
15+
- 1K<n<10K
16+
source_datasets:
17+
- original
18+
task_categories:
19+
- question-answering
20+
task_ids:
21+
- open-domain-qa
22+
---
23+
24+
# Dataset Card Creation Guide
25+
26+
## Table of Contents
27+
- [Dataset Card Creation Guide](#dataset-card-creation-guide)
28+
- [Table of Contents](#table-of-contents)
29+
- [Dataset Description](#dataset-description)
30+
- [Dataset Summary](#dataset-summary)
31+
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
32+
- [Languages](#languages)
33+
- [Dataset Structure](#dataset-structure)
34+
- [Data Instances](#data-instances)
35+
- [Data Fields](#data-fields)
36+
- [Data Splits](#data-splits)
37+
- [Dataset Creation](#dataset-creation)
38+
- [Curation Rationale](#curation-rationale)
39+
- [Source Data](#source-data)
40+
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
41+
- [Who are the source language producers?](#who-are-the-source-language-producers)
42+
- [Annotations](#annotations)
43+
- [Annotation process](#annotation-process)
44+
- [Who are the annotators?](#who-are-the-annotators)
45+
- [Personal and Sensitive Information](#personal-and-sensitive-information)
46+
- [Considerations for Using the Data](#considerations-for-using-the-data)
47+
- [Social Impact of Dataset](#social-impact-of-dataset)
48+
- [Discussion of Biases](#discussion-of-biases)
49+
- [Other Known Limitations](#other-known-limitations)
50+
- [Additional Information](#additional-information)
51+
- [Dataset Curators](#dataset-curators)
52+
- [Licensing Information](#licensing-information)
53+
- [Citation Information](#citation-information)
54+
- [Contributions](#contributions)
55+
56+
## Dataset Description
57+
58+
- **Homepage:** [https://allenai.org/data/qasper](https://allenai.org/data/qasper)
59+
- **Demo:** [https://qasper-demo.apps.allenai.org/](https://qasper-demo.apps.allenai.org/)
60+
- **Paper:** [https://arxiv.org/abs/2105.03011](https://arxiv.org/abs/2105.03011)
61+
- **Blogpost:** [https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c](https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c)
62+
- **Leaderboards:** [https://paperswithcode.com/dataset/qasper](https://paperswithcode.com/dataset/qasper)
63+
64+
### Dataset Summary
65+
66+
QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers.
67+
68+
### Supported Tasks and Leaderboards
69+
70+
- `question-answering`: The dataset can be used to train a model for Question Answering. Success on this task is typically measured by achieving a *high* [F1 score](https://huggingface.co/metrics/f1). The [official baseline model](https://github.com/allenai/qasper-led-baseline) currently achieves 33.63 Token F1 score & uses [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). This task has an active leaderboard which can be found [here](https://paperswithcode.com/sota/question-answering-on-qasper)
71+
72+
- `evidence-selection`: The dataset can be used to train a model for Evidence Selection. Success on this task is typically measured by achieving a *high* [F1 score](https://huggingface.co/metrics/f1). The [official baseline model](https://github.com/allenai/qasper-led-baseline) currently achieves 39.37 F1 score & uses [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). This task has an active leaderboard which can be found [here](https://paperswithcode.com/sota/evidence-selection-on-qasper)
73+
74+
75+
### Languages
76+
77+
English, as it is used in research papers.
78+
79+
## Dataset Structure
80+
81+
### Data Instances
82+
83+
A typical instance in the dataset:
84+
85+
```
86+
{
87+
'id': "Paper ID (string)",
88+
'title': "Paper Title",
89+
'abstract': "paper abstract ...",
90+
'full_text': {
91+
'paragraphs':[["section1_paragraph1_text","section1_paragraph2_text",...],["section2_paragraph1_text","section2_paragraph2_text",...]],
92+
'section_name':["section1_title","section2_title"],...},
93+
'qas': [explained below],
94+
}
95+
```
96+
97+
### Data Fields
98+
99+
The following is an excerpt from the dataset README:
100+
101+
Within "qas", some fields should be obvious. Here is some explanation about the others:
102+
103+
#### Fields specific to questions:
104+
105+
- "nlp_background" shows the experience the question writer had. The values can be "zero" (no experience), "two" (0 - 2 years of experience), "five" (2 - 5 years of experience), and "infinity" (> 5 years of experience). The field may be empty as well, indicating the writer has chosen not to share this information.
106+
107+
- "topic_background" shows how familiar the question writer was with the topic of the paper. The values are "unfamiliar", "familiar", "research" (meaning that the topic is the research area of the writer), or null.
108+
109+
- "paper_read", when specified shows whether the questionwriter has read the paper.
110+
111+
- "search_query", if not empty, is the query the question writer used to find the abstract of the paper from a large pool of abstracts we made available to them.
112+
113+
#### Fields specific to answers
114+
115+
Unanswerable answers have "unanswerable" set to true. The remaining answers have exactly one of the following fields being non-empty.
116+
117+
- "extractive_spans" are spans in the paper which serve as the answer.
118+
- "free_form_answer" is a written out answer.
119+
- "yes_no" is true iff the answer is Yes, and false iff the answer is No.
120+
121+
"evidence" is the set of paragraphs, figures or tables used to arrive at the answer. Tables or figures start with the string "FLOAT SELECTED"
122+
123+
"highlighted_evidence" is the set of sentences the answer providers selected as evidence if they chose textual evidence. The text in the "evidence" field is a mapping from these sentences to the paragraph level. That is, if you see textual evidence in the "evidence" field, it is guaranteed to be entire paragraphs, while that is not the case with "highlighted_evidence".
124+
125+
126+
### Data Splits
127+
128+
| | Train | Valid |
129+
| ----- | ------ | ----- |
130+
| Number of papers | 888 | 281 |
131+
| Number of questions | 2593 | 1005 |
132+
133+
## Dataset Creation
134+
135+
### Curation Rationale
136+
137+
[More Information Needed]
138+
139+
### Source Data
140+
141+
NLP papers: The full text of the papers is extracted from S2ORC (Lo et al., 2020)
142+
143+
#### Initial Data Collection and Normalization
144+
145+
[More Information Needed]
146+
147+
#### Who are the source language producers?
148+
149+
[More Information Needed]
150+
151+
### Annotations
152+
153+
[More Information Needed]
154+
155+
#### Annotation process
156+
157+
[More Information Needed]
158+
159+
#### Who are the annotators?
160+
161+
"The annotators are NLP practitioners, not
162+
expert researchers, and it is likely that an expert
163+
would score higher"
164+
165+
### Personal and Sensitive Information
166+
167+
[More Information Needed]
168+
169+
## Considerations for Using the Data
170+
171+
### Social Impact of Dataset
172+
173+
[More Information Needed]
174+
175+
### Discussion of Biases
176+
177+
[More Information Needed]
178+
179+
### Other Known Limitations
180+
181+
[More Information Needed]
182+
183+
## Additional Information
184+
185+
### Dataset Curators
186+
187+
Crowdsourced NLP practitioners
188+
189+
### Licensing Information
190+
191+
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0)
192+
193+
### Citation Information
194+
195+
```
196+
@inproceedings{Dasigi2021ADO,
197+
title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers},
198+
author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner},
199+
year={2021}
200+
}
201+
```
202+
203+
### Contributions
204+
205+
Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.

datasets/qasper/dataset_infos.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"plain_text": {"description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n", "citation": "@article{2016arXiv160605250R,\n author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},\n Konstantin and {Liang}, Percy},\n title = \"{SQuAD: 100,000+ Questions for Machine Comprehension of Text}\",\n journal = {arXiv e-prints},\n year = 2016,\n eid = {arXiv:1606.05250},\n pages = {arXiv:1606.05250},\narchivePrefix = {arXiv},\n eprint = {1606.05250},\n}\n", "homepage": "https://rajpurkar.github.io/SQuAD-explorer/", "license": "", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "squad", "config_name": "plain_text", "version": {"version_str": "1.0.0", "description": "", "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 79426386, "num_examples": 87599, "dataset_name": "squad"}, "validation": {"name": "validation", "num_bytes": 10491883, "num_examples": 10570, "dataset_name": "squad"}}, "download_checksums": {"https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json": {"num_bytes": 30288272, "checksum": "3527663986b8295af4f7fcdff1ba1ff3f72d07d61a20f487cb238a6ef92fd955"}, "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json": {"num_bytes": 4854279, "checksum": "95aa6a52d5d6a735563366753ca50492a658031da74f301ac5238b03966972c9"}}, "download_size": 35142551, "post_processing_size": null, "dataset_size": 89918269, "size_in_bytes": 125060820}, "qasper": {"description": "A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.\n", "citation": "@inproceedings{Dasigi2021ADO,\n title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers},\n author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner},\n year={2021}\n}\n", "homepage": "https://allenai.org/data/qasper", "license": "CC BY 4.0", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "abstract": {"dtype": "string", "id": null, "_type": "Value"}, "full_text": {"feature": {"section_name": {"dtype": "string", "id": null, "_type": "Value"}, "paragraphs": [{"dtype": "string", "id": null, "_type": "Value"}]}, "length": -1, "id": null, "_type": "Sequence"}, "qas": {"feature": {"question": {"dtype": "string", "id": null, "_type": "Value"}, "question_id": {"dtype": "string", "id": null, "_type": "Value"}, "nlp_background": {"dtype": "string", "id": null, "_type": "Value"}, "topic_background": {"dtype": "string", "id": null, "_type": "Value"}, "paper_read": {"dtype": "string", "id": null, "_type": "Value"}, "search_query": {"dtype": "string", "id": null, "_type": "Value"}, "question_writer": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"answer": {"unanswerable": {"dtype": "bool", "id": null, "_type": "Value"}, "extractive_spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "yes_no": {"dtype": "bool", "id": null, "_type": "Value"}, "free_form_answer": {"dtype": "string", "id": null, "_type": "Value"}, "evidence": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "highlighted_evidence": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}}, "annotation_id": {"dtype": "string", "id": null, "_type": "Value"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "qasper", "config_name": "qasper", "version": {"version_str": "0.1.0", "description": null, "major": 0, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 27277970, "num_examples": 888, "dataset_name": "qasper"}, "validation": {"name": "validation", "num_bytes": 9535330, "num_examples": 281, "dataset_name": "qasper"}}, "download_checksums": {"https://qasper-dataset.s3-us-west-2.amazonaws.com/qasper-train-dev-v0.1.tgz": {"num_bytes": 10359737, "checksum": "cd0cb8911342966fcc3eb91947af149cb7cf80b4f253ff9a6f0333f4752080dd"}}, "download_size": 10359737, "post_processing_size": null, "dataset_size": 36813300, "size_in_bytes": 47173037}}
14.7 KB
Binary file not shown.

datasets/qasper/qasper.py

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# coding=utf-8
2+
# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Lint as: python3
17+
"""Qasper: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers."""
18+
19+
20+
import json
21+
import os
22+
23+
import datasets
24+
25+
26+
logger = datasets.logging.get_logger(__name__)
27+
28+
29+
_CITATION = """\
30+
@inproceedings{Dasigi2021ADO,
31+
title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers},
32+
author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner},
33+
year={2021}
34+
}
35+
"""
36+
_LICENSE = "CC BY 4.0"
37+
_DESCRIPTION = """\
38+
A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
39+
"""
40+
41+
_HOMEPAGE = "https://allenai.org/data/qasper"
42+
_DOWNLOAD_URLS = {"data": "https://qasper-dataset.s3-us-west-2.amazonaws.com/qasper-train-dev-v0.1.tgz"}
43+
data_files = {"train": "qasper-train-v0.1.json", "dev": "qasper-dev-v0.1.json"}
44+
45+
_VERSION = "0.1.0"
46+
47+
48+
class Qasper(datasets.GeneratorBasedBuilder):
49+
"""Qasper: A Dataset of Information-Seeking Q&A Anchored in Research Papers."""
50+
51+
BUILDER_CONFIGS = [
52+
datasets.BuilderConfig(
53+
name="qasper",
54+
version=datasets.Version(_VERSION),
55+
description=_DESCRIPTION,
56+
)
57+
]
58+
59+
def _info(self):
60+
61+
features = datasets.Features(
62+
{
63+
"id": datasets.Value("string"),
64+
"title": datasets.Value("string"),
65+
"abstract": datasets.Value("string"),
66+
"full_text": datasets.features.Sequence(
67+
{
68+
"section_name": datasets.Value("string"),
69+
"paragraphs": [datasets.Value("string")],
70+
}
71+
),
72+
"qas": datasets.features.Sequence(
73+
{
74+
"question": datasets.Value("string"),
75+
"question_id": datasets.Value("string"),
76+
"nlp_background": datasets.Value("string"),
77+
"topic_background": datasets.Value("string"),
78+
"paper_read": datasets.Value("string"),
79+
"search_query": datasets.Value("string"),
80+
"question_writer": datasets.Value("string"),
81+
"answers": datasets.features.Sequence(
82+
{
83+
"answer": {
84+
"unanswerable": datasets.Value("bool"),
85+
"extractive_spans": datasets.features.Sequence(datasets.Value("string")),
86+
"yes_no": datasets.Value("bool"),
87+
"free_form_answer": datasets.Value("string"),
88+
"evidence": datasets.features.Sequence(datasets.Value("string")),
89+
"highlighted_evidence": datasets.features.Sequence(datasets.Value("string")),
90+
},
91+
"annotation_id": datasets.Value("string"),
92+
"worker_id": datasets.Value("string"),
93+
}
94+
),
95+
}
96+
),
97+
}
98+
)
99+
100+
return datasets.DatasetInfo(
101+
description=_DESCRIPTION,
102+
features=features,
103+
supervised_keys=None,
104+
homepage=_HOMEPAGE,
105+
license=_LICENSE,
106+
citation=_CITATION,
107+
)
108+
109+
def _split_generators(self, dl_manager):
110+
downloaded_files = dl_manager.download_and_extract(_DOWNLOAD_URLS)
111+
112+
return [
113+
datasets.SplitGenerator(
114+
name=datasets.Split.TRAIN,
115+
gen_kwargs={"filepath": os.path.join(downloaded_files["data"], data_files["train"])},
116+
),
117+
datasets.SplitGenerator(
118+
name=datasets.Split.VALIDATION,
119+
gen_kwargs={"filepath": os.path.join(downloaded_files["data"], data_files["dev"])},
120+
),
121+
]
122+
123+
def _generate_examples(self, filepath):
124+
"""This function returns the examples in the raw (text) form."""
125+
logger.info("generating examples from = %s", filepath)
126+
with open(filepath, encoding="utf-8") as f:
127+
qasper = json.load(f)
128+
for id_ in qasper:
129+
qasper[id_]["id"] = id_
130+
yield id_, qasper[id_]

0 commit comments

Comments
 (0)