Add KS task to SUPERB #2783

anton-l · 2021-08-10T22:14:07Z

Add the KS (keyword spotting) task as described in the SUPERB paper.

Some notable quirks:

The dataset is originally single-archive (train+val+test all in one), but the test set has a "canonical" distribution in a separate archive, which is also used here (see _split_ks_files()).
The _background_noise_/_silence_ audio files are much longer than others, so they require some sort of slicing for downstream training. I decided to leave the implementation of that up to the users, since TFDS and s3prl take different approaches (either slicing wavs deterministically, or subsampling randomly at runtime)

Related to #2619.

lewtun · 2021-08-11T14:47:51Z

thanks a lot for implementing this @anton-l !!

i won't have time to review this while i'm away, so happy for @albertvillanova and @patrickvonplaten to decide when to merge :)

patrickvonplaten

Looks very clean to me!

albertvillanova

Hi @anton-l, thanks a lot for the addition of this SUPERB task. You did an awesome job! ^^

Just some comments and suggested changes before we merge it into master.

datasets/superb/README.md

datasets/superb/superb.py

Co-authored-by: Albert Villanova del Moral <[email protected]>

anton-l · 2021-08-11T17:36:00Z

@albertvillanova thanks! Everything should be ready now :)

albertvillanova

Just one minor missing "id" and that's all! :)

datasets/superb/README.md

albertvillanova

Thank you!!

albertvillanova · 2021-08-12T16:30:42Z

The background_noise/silence audio files are much longer than others, so they require some sort of slicing for downstream training. I decided to leave the implementation of that up to the users, since TFDS and s3prl take different approaches (either slicing wavs deterministically, or subsampling randomly at runtime)

@anton-l I was thinking that maybe we could give some hints in the dataset card (in a Usage section); something similar as for diarization: https://github.com/huggingface/datasets/blob/master/datasets/superb/README.md#example-of-usage
Note that for diarization it is not yet finished: we have to test it and then provide an end-to-end example: https://github.com/huggingface/datasets/pull/2661/files#r680224909

anton-l · 2021-08-12T16:40:26Z

@albertvillanova yeah, I'm not sure how to best implement it in pure datasets yet. It's something like this, where sample_noise() needs to be called from a pytorch batch collator or other framework-specific variant:

def map_to_array(example):
    import soundfile as sf

    speech_array, sample_rate = sf.read(example["file"])
    example["speech"] = speech_array
    example["sample_rate"] = sample_rate
    return example


def sample_noise(example):
    # Use a version of this function in a stateless way to extract random 1 sec slices of background noise
    # on each epoch
    from random import randint

    # _silence_ audios are longer than 1 sec
    if example["label"] == "_silence_":
        random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
        example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]

    return example

albertvillanova · 2021-08-12T16:45:00Z

I see... Yes, not trivial indeed. Maybe for the moment you could add those functions above to the README (as it is the case for now in diarization)? What do you think?

Add keyword spotting task

25ab800

anton-l requested review from albertvillanova and lewtun August 10, 2021 22:14

Update encoding

cf2c96e

patrickvonplaten self-requested a review August 11, 2021 14:38

patrickvonplaten approved these changes Aug 11, 2021

View reviewed changes

albertvillanova requested changes Aug 11, 2021

View reviewed changes

anton-l and others added 4 commits August 11, 2021 19:46

Apply suggestions from code review

cee3f1c

Co-authored-by: Albert Villanova del Moral <[email protected]>

remove redundant sort

839e62b

remove the id field

78472c1

remove the id field

557528e

anton-l requested a review from albertvillanova August 11, 2021 17:36

albertvillanova requested changes Aug 11, 2021

View reviewed changes

datasets/superb/README.md Outdated Show resolved Hide resolved

remove leftover id

0aa6a64

albertvillanova approved these changes Aug 11, 2021

View reviewed changes

albertvillanova merged commit 522e608 into huggingface:master Aug 11, 2021

albertvillanova mentioned this pull request Oct 4, 2021

Fix Windows paths in SUPERB benchmark datasets #3009

Merged

Add KS task to SUPERB #2783

Add KS task to SUPERB #2783

Uh oh!

Conversation

anton-l commented Aug 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun commented Aug 11, 2021

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anton-l commented Aug 11, 2021

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova commented Aug 12, 2021

Uh oh!

anton-l commented Aug 12, 2021

Uh oh!

albertvillanova commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anton-l commented Aug 10, 2021 •

edited

Loading