Skip to content

Conversation

@anton-l
Copy link
Member

@anton-l anton-l commented Aug 10, 2021

Add the KS (keyword spotting) task as described in the SUPERB paper.

Some notable quirks:

  • The dataset is originally single-archive (train+val+test all in one), but the test set has a "canonical" distribution in a separate archive, which is also used here (see _split_ks_files()).
  • The _background_noise_/_silence_ audio files are much longer than others, so they require some sort of slicing for downstream training. I decided to leave the implementation of that up to the users, since TFDS and s3prl take different approaches (either slicing wavs deterministically, or subsampling randomly at runtime)

Related to #2619.

@patrickvonplaten patrickvonplaten self-requested a review August 11, 2021 14:38
@lewtun
Copy link
Member

lewtun commented Aug 11, 2021

thanks a lot for implementing this @anton-l !!

i won't have time to review this while i'm away, so happy for @albertvillanova and @patrickvonplaten to decide when to merge :)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very clean to me!

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anton-l, thanks a lot for the addition of this SUPERB task. You did an awesome job! ^^

Just some comments and suggested changes before we merge it into master.

@anton-l
Copy link
Member Author

anton-l commented Aug 11, 2021

@albertvillanova thanks! Everything should be ready now :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor missing "id" and that's all! :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

@albertvillanova albertvillanova merged commit 522e608 into huggingface:master Aug 11, 2021
@albertvillanova
Copy link
Member

The background_noise/silence audio files are much longer than others, so they require some sort of slicing for downstream training. I decided to leave the implementation of that up to the users, since TFDS and s3prl take different approaches (either slicing wavs deterministically, or subsampling randomly at runtime)

@anton-l I was thinking that maybe we could give some hints in the dataset card (in a Usage section); something similar as for diarization: https://github.com/huggingface/datasets/blob/master/datasets/superb/README.md#example-of-usage
Note that for diarization it is not yet finished: we have to test it and then provide an end-to-end example: https://github.com/huggingface/datasets/pull/2661/files#r680224909

@anton-l
Copy link
Member Author

anton-l commented Aug 12, 2021

@albertvillanova yeah, I'm not sure how to best implement it in pure datasets yet. It's something like this, where sample_noise() needs to be called from a pytorch batch collator or other framework-specific variant:

def map_to_array(example):
    import soundfile as sf

    speech_array, sample_rate = sf.read(example["file"])
    example["speech"] = speech_array
    example["sample_rate"] = sample_rate
    return example


def sample_noise(example):
    # Use a version of this function in a stateless way to extract random 1 sec slices of background noise
    # on each epoch
    from random import randint

    # _silence_ audios are longer than 1 sec
    if example["label"] == "_silence_":
        random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
        example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]

    return example

@albertvillanova
Copy link
Member

I see... Yes, not trivial indeed. Maybe for the moment you could add those functions above to the README (as it is the case for now in diarization)? What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants