Skip to content

Conversation

@lewtun
Copy link
Member

@lewtun lewtun commented Jul 9, 2021

This PR starts building up the SUPERB benchmark by including the ASR task as described in the SUPERB paper and s3prl instructions.

Usage:

from datasets import load_dataset 

asr = load_dataset("superb", "asr")
# DatasetDict({
#     train: Dataset({
#         features: ['file', 'text', 'speaker_id', 'chapter_id', 'id'],
#         num_rows: 28539
#     })
#     validation: Dataset({
#         features: ['file', 'text', 'speaker_id', 'chapter_id', 'id'],
#         num_rows: 2703
#     })
#     test: Dataset({
#         features: ['file', 'text', 'speaker_id', 'chapter_id', 'id'],
#         num_rows: 2620
#     })
# })

I've used the GLUE benchmark as a guide for filling out the README.

To move fast during the evaluation PoC I propose to merge one task at a time, so we can continue building the training / evaluation framework in parallel.

Note: codewise this PR is ready for review - I'll add the missing YAML tags once #2620 is merged :)

@lewtun lewtun changed the title Add ASR task Add ASR task for SUPERB Jul 9, 2021
@lewtun
Copy link
Member Author

lewtun commented Jul 12, 2021

Wait until #2620 is merged before pushing the README tags in this PR

@lewtun lewtun marked this pull request as ready for review July 12, 2021 15:33
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2620 is already merged into master. You can merge master into this branch.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

One question: aren't you adding task_templates to the _info method (and to the dataset_infos.json?

@lewtun
Copy link
Member Author

lewtun commented Jul 13, 2021

Thanks!

One question: aren't you adding task_templates to the _info method (and to the dataset_infos.json?

great catch! i've now added the asr task template (along with a mapping from superb task -> template) and updated the dataset_infos.json :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

I have a suggested refactoring... Tell me what you think! :)

Co-authored-by: Albert Villanova del Moral <[email protected]>
@lewtun
Copy link
Member Author

lewtun commented Jul 13, 2021

Good!

I have a suggested refactoring... Tell me what you think! :)

your approach is much more elegant - i've included your suggestions 🙏

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@albertvillanova albertvillanova added this to the 1.10 milestone Jul 13, 2021
@albertvillanova albertvillanova merged commit ddb5d80 into huggingface:master Jul 13, 2021
@lewtun lewtun deleted the add-superb branch July 13, 2021 13:01
This was referenced Jul 15, 2021
@anton-l anton-l mentioned this pull request Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants