Skip to content

Conversation

@bhavitvyamalik
Copy link
Contributor

GooAQ dataset was recently updated after splits were added for the same. This PR contains new updated GooAQ with train/val/test splits and updated README as well.

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented Aug 12, 2021

@albertvillanova my tests are failing here:

dataset_name = 'gooaq'

    def test_load_dataset(self, dataset_name):
        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]
>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)

tests/test_dataset_common.py:234: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_dataset_common.py:187: in check_load_dataset
    self.parent.assertTrue(len(dataset[split]) > 0)
E   AssertionError: False is not true

When I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bhavitvyamalik, thanks a lot for the addition of this dataset! ^^

The error you get is due to the dummy data you generated:

  • The IDs present in the dummy gooaq.jsonl are: 1, 2, 3, 4 and 5
  • However, the IDs present in the dummy split.json are: {"dev": [3880119, 1038845, 2069835, 1960624, 2938642], "test": [2022145, 6465663, 2063013, 1139244, 1996513], "train": [[2302335, 0.5], [6028813, 1.0], [2560106, 1.0], [2208050, 0.5], [3073548, 0.6666666666666666]]}

Because of this mismatch, the test generates a dataset which is empty for all the splits:

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 0
    })
    validation: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 0
    })
    test: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 0
    })
})

And the test fails, as it checks that the dataset is not empty for all splits: len(dataset[split]) > 0

You should modify one of the files so that the IDs in both files match.

For example by setting split.json to:

{"dev": [1], "test": [2, 3], "train": [[4, 0.5], [5, 1.0]]}

@bhavitvyamalik
Copy link
Contributor Author

Thanks for the help, @albertvillanova! All tests are passing now.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you ! Before we merge, could you just update the version of the Gooaq builder class ?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@lhoestq lhoestq merged commit e34e5cd into huggingface:master Aug 27, 2021
@lhoestq lhoestq changed the title Update GooAQ Update: GooAQ - add train/val/test splits Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants