diff --git a/datasets/subjqa/README.md b/datasets/subjqa/README.md
new file mode 100644
index 00000000000..5c896afb47a
--- /dev/null
+++ b/datasets/subjqa/README.md
@@ -0,0 +1,288 @@
+---
+annotations_creators:
+- expert-generated
+language_creators:
+- found
+languages:
+- en
+licenses:
+- unknown
+multilinguality:
+- monolingual
+size_categories:
+- 1K<n<10K
+source_datasets:
+- original
+- extended|yelp_review_full
+- extended|other-amazon_reviews_ucsd
+- extended|other-tripadvisor_reviews
+task_categories:
+- question-answering
+task_ids:
+- extractive-qa
+---
+
+# Dataset Card for subjqa
+
+## Table of Contents
+- [Dataset Card for subjqa](#dataset-card-for-subjqa)
+  - [Table of Contents](#table-of-contents)
+  - [Dataset Description](#dataset-description)
+    - [Dataset Summary](#dataset-summary)
+    - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+    - [Languages](#languages)
+  - [Dataset Structure](#dataset-structure)
+    - [Data Instances](#data-instances)
+    - [Data Fields](#data-fields)
+    - [Data Splits](#data-splits)
+  - [Dataset Creation](#dataset-creation)
+    - [Curation Rationale](#curation-rationale)
+    - [Source Data](#source-data)
+      - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
+      - [Who are the source language producers?](#who-are-the-source-language-producers)
+    - [Annotations](#annotations)
+      - [Annotation process](#annotation-process)
+      - [Who are the annotators?](#who-are-the-annotators)
+    - [Personal and Sensitive Information](#personal-and-sensitive-information)
+  - [Considerations for Using the Data](#considerations-for-using-the-data)
+    - [Social Impact of Dataset](#social-impact-of-dataset)
+    - [Discussion of Biases](#discussion-of-biases)
+    - [Other Known Limitations](#other-known-limitations)
+  - [Additional Information](#additional-information)
+    - [Dataset Curators](#dataset-curators)
+    - [Licensing Information](#licensing-information)
+    - [Citation Information](#citation-information)
+    - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Repository:** https://github.com/lewtun/SubjQA
+- **Paper:** https://arxiv.org/abs/2004.14283
+- **Point of Contact:** [Lewis Tunstall](mailto:lewis.c.tunstall@gmail.com)
+
+### Dataset Summary
+
+SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly **10,000** questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a _subjectivity_ label by annotators. Questions such as _"How much does this product weigh?"_ is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
+
+In short, SubjQA provides a setting to study how well extractive QA systems perform on finding answer that are less factual and to what extent modeling subjectivity can improve the performance of QA systems.
+
+_Note:_ Much of the information provided on this dataset card is taken from the README provided by the authors in their GitHub repository ([link](https://github.com/megagonlabs/SubjQA)).
+
+To load a domain with `datasets` you can run the following:
+
+```python
+from datasets import load_dataset
+
+# other options include: electronics, grocery, movies, restaurants, tripadvisor
+dataset = load_dataset("subjqa", "books")
+```
+
+### Supported Tasks and Leaderboards
+
+* `question-answering`: The dataset can be used to train a model for extractive question answering, which involves questions whose answer can be identified as a span of text in a review. Success on this task is typically measured by achieving a high Exact Match or F1 score. The BERT model that is first fine-tuned on SQuAD 2.0 and then further fine-tuned on SubjQA achieves the scores shown in the figure below.
+
+![scores](https://user-images.githubusercontent.com/26859204/117199763-e02e1100-adea-11eb-9198-f3190329a588.png)
+
+
+### Languages
+
+The text in the dataset is in English and the associated BCP-47 code is `en`.
+
+## Dataset Structure
+
+### Data Instances
+
+An example from `books` domain is shown below:
+
+```json
+{
+    "answers": {
+        "ans_subj_score": [1.0],
+        "answer_start": [324],
+        "answer_subj_level": [2],
+        "is_ans_subjective": [true],
+        "text": ["This is a wonderfully written book"],
+    },
+    "context": "While I would not recommend this book to a young reader due to a couple pretty explicate scenes I would recommend it to any adult who just loves a good book.  Once I started reading it I could not put it down.  I hesitated reading it because I didn't think that the subject matter would be interesting, but I was so wrong.  This is a wonderfully written book.",
+    "domain": "books",
+    "id": "0255768496a256c5ed7caed9d4e47e4c",
+    "is_ques_subjective": false,
+    "nn_asp": "matter",
+    "nn_mod": "interesting",
+    "q_reviews_id": "a907837bafe847039c8da374a144bff9",
+    "query_asp": "part",
+    "query_mod": "fascinating",
+    "ques_subj_score": 0.0,
+    "question": "What are the parts like?",
+    "question_subj_level": 2,
+    "review_id": "a7f1a2503eac2580a0ebbc1d24fffca1",
+    "title": "0002007770",
+}
+```
+
+### Data Fields
+
+Each domain and split consists of the following columns:
+
+* ```title```: The id of the item/business discussed in the review.
+* ```question```: The question (written based on a query opinion).
+* ```id```: A unique id assigned to the question-review pair.
+* ```q_reviews_id```: A unique id assigned to all question-review pairs with a shared question.
+* ```question_subj_level```: The subjectiviy level of the question (on a 1 to 5 scale with 1 being the most subjective).
+* ```ques_subj_score```: The subjectivity score of the question computed using the [TextBlob](https://textblob.readthedocs.io/en/dev/) package.
+* ```context```: The review (that mentions the neighboring opinion).
+* ```review_id```: A unique id associated with the review.
+* ```answers.text```: The span labeled by annotators as the answer.
+* ```answers.answer_start```: The (character-level) start index of the answer span highlighted by annotators.
+* ```is_ques_subjective```: A boolean subjectivity label derived from ```question_subj_level``` (i.e., scores below 4 are considered as subjective)
+* ```answers.answer_subj_level```: The subjectiviy level of the answer span (on a 1 to 5 scale with 5 being the most subjective).
+* ```answers.ans_subj_score```: The subjectivity score of the answer span computed usign the [TextBlob](https://textblob.readthedocs.io/en/dev/) package.
+* ```answers.is_ans_subjective```: A boolean subjectivity label derived from ```answer_subj_level``` (i.e., scores below 4 are considered as subjective)
+* ```domain```: The category/domain of the review (e.g., hotels, books, ...).
+* ```nn_mod```: The modifier of the neighboring opinion (which appears in the review).
+* ```nn_asp```: The aspect of the neighboring opinion (which appears in the review).
+* ```query_mod```: The modifier of the query opinion (around which a question is manually written).
+* ```query_asp```: The aspect of the query opinion (around which a question is manually written).
+
+### Data Splits
+
+The question-review pairs from each domain are split into training, development, and test sets. The table below shows the size of the dataset per each domain and split.
+
+| Domain      | Train | Dev | Test | Total |
+|-------------|-------|-----|------|-------|
+| TripAdvisor | 1165  | 230 | 512  | 1686  |
+| Restaurants | 1400  | 267 | 266  | 1683  |
+| Movies      | 1369  | 261 | 291  | 1677  |
+| Books       | 1314  | 256 | 345  | 1668  |
+| Electronics | 1295  | 255 | 358  | 1659  |
+| Grocery     | 1124  | 218 | 591  | 1725  |
+
+Based on the subjectivity labels provided by annotators, one observes that 73% of the questions and 74% of the answers in the dataset are subjective. This provides a substantial number of subjective QA pairs as well as a reasonable number of factual questions to compare and constrast the performance of QA systems on each type of QA pairs.
+
+Finally, the next table summarizes the average length of the question, the review, and the highlighted answer span for each category.
+
+| Domain      | Review Len | Question Len | Answer Len | % answerable |
+|-------------|------------|--------------|------------|--------------|
+| TripAdvisor | 187.25     | 5.66         | 6.71       | 78.17        |
+| Restaurants | 185.40     | 5.44         | 6.67       | 60.72        |
+| Movies      | 331.56     | 5.59         | 7.32       | 55.69        |
+| Books       | 285.47     | 5.78         | 7.78       | 52.99        |
+| Electronics | 249.44     | 5.56         | 6.98       | 58.89        |
+| Grocery     | 164.75     | 5.44         | 7.25       | 64.69        |
+
+## Dataset Creation
+
+### Curation Rationale
+
+Most question-answering datasets like SQuAD and Natural Questions focus on answering questions over factual data such as Wikipedia and news articles. However, in domains like e-commerce the questions and answers are often _subjective_, that is, they depend on the personal experience of the users. For example, a customer on Amazon may ask "Is the sound quality any good?", which is more difficult to answer than a factoid question like "What is the capital of Australia?" These considerations motivate the creation of SubjQA as a tool to investigate the relationship between subjectivity and question-answering.
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+The SubjQA dataset is constructed based on publicly available review datasets. Specifically, the _movies_, _books_, _electronics_, and _grocery_ categories are constructed using reviews from the [Amazon Review dataset](http://jmcauley.ucsd.edu/data/amazon/links.html). The _TripAdvisor_ category, as the name suggests, is constructed using reviews from TripAdvisor which can be found [here](http://times.cs.uiuc.edu/~wang296/Data/). Finally, the _restaurants_ category is constructed using the [Yelp Dataset](https://www.yelp.com/dataset) which is also publicly available.
+
+The process of constructing SubjQA is discussed in detail in the [paper](https://arxiv.org/abs/2004.14283). In a nutshell, the dataset construction consists of the following steps:
+
+1. First, all _opinions_ expressed in reviews are extracted. In the pipeline, each opinion is modeled as a (_modifier_, _aspect_) pair which is a pair of spans where the former describes the latter. (good, hotel), and (terrible, acting) are a few examples of extracted opinions.
+2. Using Matrix Factorization techniques, implication relationships between different expressed opinions are mined. For instance, the system mines that "responsive keys" implies "good keyboard". In our pipeline, we refer to the conclusion of an implication (i.e., "good keyboard" in this examples) as the _query_ opinion, and we refer to the premise (i.e., "responsive keys") as its _neighboring_ opinion.
+3. Annotators are then asked to write a question based on _query_ opinions. For instance given "good keyboard" as the query opinion, they might write "Is this keyboard any good?"
+4. Each question written based on a _query_ opinion is then paired with a review that mentions its _neighboring_ opinion. In our example, that would be a review that mentions "responsive keys".
+5. The question and review pairs are presented to annotators to select the correct answer span, and rate the subjectivity level of the question as well as the subjectivity level of the highlighted answer span.
+
+A visualisation of the data collection pipeline is shown in the image below.
+
+![preview](https://user-images.githubusercontent.com/26859204/117258393-3764cd80-ae4d-11eb-955d-aa971dbb282e.jpg)
+
+#### Who are the source language producers?
+
+As described above, the source data for SubjQA is customer reviews of products and services on e-commerce websites like Amazon and TripAdvisor.
+
+### Annotations
+
+#### Annotation process
+
+The generation of questions and answer span labels were obtained through the [Appen](https://appen.com/) platform. From the SubjQA paper:
+
+> The platform provides quality control by showing the workers 5 questions at a time, out of which one is labeled by the experts. A worker who fails to maintain 70% accuracy is kicked out by the platform and his judgements are ignored ... To ensure good quality labels, we paid each worker 5 cents per annotation.
+
+The instructions for generating a question are shown in the following figure:
+
+<img width="874" alt="ques_gen" src="https://user-images.githubusercontent.com/26859204/117259092-03d67300-ae4e-11eb-81f2-9077fee1085f.png">
+
+Similarly, the interface for the answer span and subjectivity labelling tasks is shown below:
+
+![span_collection](https://user-images.githubusercontent.com/26859204/117259223-1fda1480-ae4e-11eb-9305-658ee6e3971d.png)
+
+As described in the SubjQA paper, the workers assign subjectivity scores (1-5) to each question and the selected answer span. They can also indicate if a question cannot be answered from the given review. 
+
+
+#### Who are the annotators?
+
+Workers on the Appen platform.
+
+### Personal and Sensitive Information
+
+[Needs More Information]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+The SubjQA dataset can be used to develop question-answering systems that can provide better on-demand answers to e-commerce customers who are interested in subjective questions about products and services.
+
+### Discussion of Biases
+
+[Needs More Information]
+
+### Other Known Limitations
+
+[Needs More Information]
+
+## Additional Information
+
+### Dataset Curators
+
+The people involved in creating the SubjQA dataset are the authors of the accompanying paper:
+
+* Johannes Bjerva1, Department of Computer Science, University of Copenhagen, Department of Computer Science, Aalborg University
+* Nikita Bhutani, Megagon Labs, Mountain View
+* Behzad Golshan, Megagon Labs, Mountain View
+* Wang-Chiew Tan, Megagon Labs, Mountain View
+* Isabelle Augenstein, Department of Computer Science, University of Copenhagen
+
+### Licensing Information
+
+The SubjQA dataset is provided "as-is", and its creators make no representation as to its accuracy.
+
+The SubjQA dataset is constructed based on the following datasets and thus contains subsets of their data:
+* [Amazon Review Dataset](http://jmcauley.ucsd.edu/data/amazon/links.html) from UCSD
+    * Used for _books_, _movies_, _grocery_, and _electronics_ domains
+* [The TripAdvisor Dataset](http://times.cs.uiuc.edu/~wang296/Data/) from UIUC's Database and Information Systems Laboratory
+    * Used for the _TripAdvisor_ domain
+* [The Yelp Dataset](https://www.yelp.com/dataset)
+    * Used for the _restaurants_ domain
+
+Consequently, the data within each domain of the SubjQA dataset should be considered under the same license as the dataset it was built upon.
+
+### Citation Information
+
+If you are using the dataset, please cite the following in your work:
+```
+@inproceedings{bjerva20subjqa,
+    title = "SubjQA: A Dataset for Subjectivity and Review Comprehension",
+    author = "Bjerva, Johannes  and
+      Bhutani, Nikita  and
+      Golahn, Behzad  and
+      Tan, Wang-Chiew  and
+      Augenstein, Isabelle",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = November,
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+}
+```
+
+### Contributions
+
+Thanks to [@lewtun](https://github.com/lewtun) for adding this dataset.
diff --git a/datasets/subjqa/dataset_infos.json b/datasets/subjqa/dataset_infos.json
new file mode 100644
index 00000000000..8ef13ca17bb
--- /dev/null
+++ b/datasets/subjqa/dataset_infos.json
@@ -0,0 +1 @@
+{"books": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "books", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2473128, "num_examples": 1314, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 649413, "num_examples": 345, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 460214, "num_examples": 256, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 3582755, "size_in_bytes": 14967412}, "electronics": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "electronics", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2123648, "num_examples": 1295, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 608899, "num_examples": 358, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 419042, "num_examples": 255, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 3151589, "size_in_bytes": 14536246}, "grocery": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "grocery", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1317488, "num_examples": 1124, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 721827, "num_examples": 591, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 254432, "num_examples": 218, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 2293747, "size_in_bytes": 13678404}, "movies": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "movies", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2986348, "num_examples": 1369, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 620513, "num_examples": 291, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 589663, "num_examples": 261, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 4196524, "size_in_bytes": 15581181}, "restaurants": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "restaurants", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1823331, "num_examples": 1400, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 335453, "num_examples": 266, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 349354, "num_examples": 267, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 2508138, "size_in_bytes": 13892795}, "tripadvisor": {"description": "SubjQA is a question answering dataset that focuses on subjective questions and answers.\nThe dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,\nelectronics, TripAdvisor (i.e. hotels), and restaurants.", "citation": "@inproceedings{bjerva20subjqa,\n    title = \"SubjQA: A Dataset for Subjectivity and Review Comprehension\",\n    author = \"Bjerva, Johannes  and\n      Bhutani, Nikita  and\n      Golahn, Behzad  and\n      Tan, Wang-Chiew  and\n      Augenstein, Isabelle\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = November,\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n}\n", "homepage": "", "license": "", "features": {"domain": {"dtype": "string", "id": null, "_type": "Value"}, "nn_mod": {"dtype": "string", "id": null, "_type": "Value"}, "nn_asp": {"dtype": "string", "id": null, "_type": "Value"}, "query_mod": {"dtype": "string", "id": null, "_type": "Value"}, "query_asp": {"dtype": "string", "id": null, "_type": "Value"}, "q_reviews_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ques_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ques_subjective": {"dtype": "bool", "id": null, "_type": "Value"}, "review_id": {"dtype": "string", "id": null, "_type": "Value"}, "id": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "context": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answers": {"feature": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "answer_start": {"dtype": "int32", "id": null, "_type": "Value"}, "answer_subj_level": {"dtype": "int64", "id": null, "_type": "Value"}, "ans_subj_score": {"dtype": "float32", "id": null, "_type": "Value"}, "is_ans_subjective": {"dtype": "bool", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "subjqa", "config_name": "tripadvisor", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1575021, "num_examples": 1165, "dataset_name": "subjqa"}, "test": {"name": "test", "num_bytes": 689508, "num_examples": 512, "dataset_name": "subjqa"}, "validation": {"name": "validation", "num_bytes": 312645, "num_examples": 230, "dataset_name": "subjqa"}}, "download_checksums": {"https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip": {"num_bytes": 11384657, "checksum": "f3d58fd04c698fccb326b7ea4ea93098cc2186a3925f4bbad9b538ed7acc72db"}}, "download_size": 11384657, "post_processing_size": null, "dataset_size": 2577174, "size_in_bytes": 13961831}}
\ No newline at end of file
diff --git a/datasets/subjqa/dummy/books/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/books/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..6135c4d1f69
Binary files /dev/null and b/datasets/subjqa/dummy/books/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/dummy/electronics/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/electronics/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..10364ccb87f
Binary files /dev/null and b/datasets/subjqa/dummy/electronics/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/dummy/grocery/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/grocery/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..77e2a73d123
Binary files /dev/null and b/datasets/subjqa/dummy/grocery/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/dummy/movies/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/movies/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..d6f6346f4d9
Binary files /dev/null and b/datasets/subjqa/dummy/movies/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/dummy/restaurants/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/restaurants/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..48a0d552275
Binary files /dev/null and b/datasets/subjqa/dummy/restaurants/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/dummy/tripadvisor/1.1.0/dummy_data.zip b/datasets/subjqa/dummy/tripadvisor/1.1.0/dummy_data.zip
new file mode 100644
index 00000000000..3faea760c3e
Binary files /dev/null and b/datasets/subjqa/dummy/tripadvisor/1.1.0/dummy_data.zip differ
diff --git a/datasets/subjqa/subjqa.py b/datasets/subjqa/subjqa.py
new file mode 100644
index 00000000000..642759c4046
--- /dev/null
+++ b/datasets/subjqa/subjqa.py
@@ -0,0 +1,211 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""SubjQA is a question answering dataset that focuses on subjective questions and answers.
+The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,
+electronics, TripAdvisor (i.e. hotels), and restaurants."""
+
+
+import ast
+import os
+
+import pandas as pd
+
+import datasets
+
+
+_CITATION = """\
+@inproceedings{bjerva20subjqa,
+    title = "SubjQA: A Dataset for Subjectivity and Review Comprehension",
+    author = "Bjerva, Johannes  and
+      Bhutani, Nikita  and
+      Golahn, Behzad  and
+      Tan, Wang-Chiew  and
+      Augenstein, Isabelle",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = November,
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+}
+"""
+
+_DESCRIPTION = """SubjQA is a question answering dataset that focuses on subjective questions and answers.
+The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery,
+electronics, TripAdvisor (i.e. hotels), and restaurants."""
+
+_HOMEPAGE = ""
+
+_LICENSE = ""
+
+_URLs = {"default": "https://github.com/lewtun/SubjQA/archive/refs/heads/master.zip"}
+
+
+class Subjqa(datasets.GeneratorBasedBuilder):
+    """SubjQA is a question answering dataset that focuses on subjective questions and answers."""
+
+    VERSION = datasets.Version("1.1.0")
+
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name="books", version=VERSION, description="Amazon book reviews"),
+        datasets.BuilderConfig(name="electronics", version=VERSION, description="Amazon electronics reviews"),
+        datasets.BuilderConfig(name="grocery", version=VERSION, description="Amazon grocery reviews"),
+        datasets.BuilderConfig(name="movies", version=VERSION, description="Amazon movie reviews"),
+        datasets.BuilderConfig(name="restaurants", version=VERSION, description="Yelp restaurant reviews"),
+        datasets.BuilderConfig(name="tripadvisor", version=VERSION, description="TripAdvisor hotel reviews"),
+    ]
+
+    def _info(self):
+        features = datasets.Features(
+            {
+                "domain": datasets.Value("string"),
+                "nn_mod": datasets.Value("string"),
+                "nn_asp": datasets.Value("string"),
+                "query_mod": datasets.Value("string"),
+                "query_asp": datasets.Value("string"),
+                "q_reviews_id": datasets.Value("string"),
+                "question_subj_level": datasets.Value("int64"),
+                "ques_subj_score": datasets.Value("float"),
+                "is_ques_subjective": datasets.Value("bool"),
+                "review_id": datasets.Value("string"),
+                "id": datasets.Value("string"),
+                "title": datasets.Value("string"),
+                "context": datasets.Value("string"),
+                "question": datasets.Value("string"),
+                "answers": datasets.features.Sequence(
+                    {
+                        "text": datasets.Value("string"),
+                        "answer_start": datasets.Value("int32"),
+                        "answer_subj_level": datasets.Value("int64"),
+                        "ans_subj_score": datasets.Value("float"),
+                        "is_ans_subjective": datasets.Value("bool"),
+                    }
+                ),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        data_dir = dl_manager.download_and_extract(_URLs["default"])
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, f"SubjQA-master/SubjQA/{self.config.name}/splits/train.csv")
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, f"SubjQA-master/SubjQA/{self.config.name}/splits/test.csv")
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, f"SubjQA-master/SubjQA/{self.config.name}/splits/dev.csv")
+                },
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        df = pd.read_csv(filepath)
+        squad_format = self._convert_to_squad(df)
+        for example in squad_format["data"]:
+            title = example.get("title", "").strip()
+            for paragraph in example["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    question = qa["question"].strip()
+                    question_meta = {k: v for k, v in qa.items() if k in self.question_meta_columns}
+                    id_ = qa["id"]
+                    answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                    answers = [answer["text"].strip() for answer in qa["answers"]]
+                    answer_meta = pd.DataFrame(qa["answers"], columns=self.answer_meta_columns).to_dict("list")
+                    yield id_, {
+                        **{
+                            "title": title,
+                            "context": context,
+                            "question": question,
+                            "id": id_,
+                            "answers": {
+                                **{
+                                    "answer_start": answer_starts,
+                                    "text": answers,
+                                },
+                                **answer_meta,
+                            },
+                        },
+                        **question_meta,
+                    }
+
+    def _create_paragraphs(self, df):
+        "A helper function to convert a pandas.DataFrame of (question, context, answer) rows to SQuAD paragraphs."
+        self.question_meta_columns = [
+            "domain",
+            "nn_mod",
+            "nn_asp",
+            "query_mod",
+            "query_asp",
+            "q_reviews_id",
+            "question_subj_level",
+            "ques_subj_score",
+            "is_ques_subjective",
+            "review_id",
+        ]
+        self.answer_meta_columns = ["answer_subj_level", "ans_subj_score", "is_ans_subjective"]
+        id2review = dict(zip(df["review_id"], df["review"]))
+        pars = []
+        for review_id, review in id2review.items():
+            qas = []
+            review_df = df.query(f"review_id == '{review_id}'")
+            id2question = dict(zip(review_df["q_review_id"], review_df["question"]))
+
+            for k, v in id2question.items():
+                d = df.query(f"q_review_id == '{k}'").to_dict(orient="list")
+                answer_starts = [ast.literal_eval(a)[0] for a in d["human_ans_indices"]]
+                answer_meta = {k: v[0] for k, v in d.items() if k in self.answer_meta_columns}
+                question_meta = {k: v[0] for k, v in d.items() if k in self.question_meta_columns}
+                # Only fill answerable questions
+                if pd.unique(d["human_ans_spans"])[0] != "ANSWERNOTFOUND":
+                    answers = [
+                        {**{"text": text, "answer_start": answer_start}, **answer_meta}
+                        for text, answer_start in zip(d["human_ans_spans"], answer_starts)
+                        if text != "ANSWERNOTFOUND"
+                    ]
+                else:
+                    answers = []
+                qas.append({**{"question": v, "id": k, "answers": answers}, **question_meta})
+            # Slice off ANSWERNOTFOUND from context
+            pars.append({"qas": qas, "context": review[: -len(" ANSWERNOTFOUND")]})
+        return pars
+
+    def _convert_to_squad(self, df):
+        "A helper function to convert a pandas.DataFrame of product-based QA dataset into SQuAD format"
+        groups = (
+            df.groupby("item_id")
+            .apply(self._create_paragraphs)
+            .to_frame(name="paragraphs")
+            .reset_index()
+            .rename(columns={"item_id": "title"})
+        )
+        squad_data = {}
+        squad_data["data"] = groups.to_dict(orient="records")
+        return squad_data