Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 49 additions & 11 deletions datasets/fever/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,24 @@
languages:
- en
paperswithcode_id: fever
annotations_creators:
- crowdsourced
language_creators:
- found
licenses:
- cc-by-sa-3.0
- gpl-3.0
multilinguality:
- monolingual
pretty_name: FEVER
size_categories:
- 100K<n<1M
source_datasets:
- extended|wikipedia
task_categories:
- text-classification
task_ids:
- text-classification-other-knowledge-verification
---

# Dataset Card for "fever"
Expand Down Expand Up @@ -46,15 +64,17 @@ With billions of individual pages on the web providing information on almost eve

The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction.

FEVER V1.0
It consists of claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as SUPPORTED, REFUTED or NOTENOUGHINFO by annotators.

### Supported Tasks and Leaderboards

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The task is verification of textual claims against textual sources.

When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the passage to verify each claim is given, and in recent years it typically consists a single sentence, while in verification systems it is retrieved from a large set of documents in order to form the evidence.

### Languages

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset is in English.

## Dataset Structure

Expand All @@ -70,7 +90,13 @@ We show detailed information for up to 5 configurations of the dataset.

An example of 'train' looks as follows.
```

'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
'label': 'SUPPORTS',
'id': 75397,
'evidence_id': 104971,
'evidence_sentence_id': 7,
'evidence_annotation_id': 92206}
```

#### v2.0
Expand All @@ -81,7 +107,13 @@ An example of 'train' looks as follows.

An example of 'validation' looks as follows.
```

{'claim': "There is a convicted statutory rapist called Chinatown's writer.",
'evidence_wiki_url': '',
'label': 'NOT ENOUGH INFO',
'id': 500000,
'evidence_id': -1,
'evidence_sentence_id': -1,
'evidence_annotation_id': 269158}
```

#### wiki_pages
Expand All @@ -92,14 +124,17 @@ An example of 'validation' looks as follows.

An example of 'wikipedia_pages' looks as follows.
```

{'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ',
'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t',
'id': '1928_in_association_football'}
```

### Data Fields

The data fields are the same among all splits.

#### v1.0

- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
Expand All @@ -109,6 +144,7 @@ The data fields are the same among all splits.
- `evidence_sentence_id`: a `int32` feature.

#### v2.0

- `id`: a `int32` feature.
- `label`: a `string` feature.
- `claim`: a `string` feature.
Expand All @@ -118,6 +154,7 @@ The data fields are the same among all splits.
- `evidence_sentence_id`: a `int32` feature.

#### wiki_pages

- `id`: a `string` feature.
- `text`: a `string` feature.
- `lines`: a `string` feature.
Expand Down Expand Up @@ -194,20 +231,21 @@ The data fields are the same among all splits.

### Licensing Information

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

### Citation Information
FEVER license:

```
These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the “License Terms”). You may not use these files except in compliance with the applicable License Terms.
```

### Citation Information

```bibtex
@inproceedings{Thorne18Fever,
author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
title = {{FEVER}: a Large-scale Dataset for Fact Extraction and VERification},
booktitle = {NAACL-HLT},
year = {2018}
}
}

```


Expand Down
6 changes: 3 additions & 3 deletions datasets/fever/fever.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,8 +212,8 @@ def _generate_examples(self, filepath):
"evidence_sentence_id": -1,
}
elif self.config.name == "wiki_pages":
for file in filepath:
for file_id, file in enumerate(filepath):
with open(file, encoding="utf-8") as f:
for id_, row in enumerate(f):
for row_id, row in enumerate(f):
data = json.loads(row)
yield id_, data
yield f"{file_id}_{row_id}", data