Add SD task for SUPERB #2661

albertvillanova · 2021-07-16T16:43:21Z

Include the SD (Speaker Diarization) task as described in the SUPERB paper and s3prl instructions.

TODO:

Generate the LibriMix corpus
Prepare the corpus for diarization
Upload these files to the superb-data repo
Transcribe the corresponding s3prl processing of these files into our superb loading script
README: tags + description sections
~~Add DER metric~~ (we leave the DER metric for a follow-up PR)

Related to #2619.

Close #2653.

cc: @lewtun

lewtun

LGTM! thanks a lot for including this complex dataset 😍

lewtun · 2021-07-28T22:06:25Z

datasets/superb/superb.py

+            We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using
+            alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER)."""
+            ),
+            features=datasets.Features(


do you happen to know if this schema allows one to easily compute the diarization error rate or is some additional preprocessing required?

if i understand correctly, for each frame, the fine-tuned models will predict logits per speaker so i'm wondering how we can connect this to the current schema?

the reason i'm asking is that ultimately i'd like to do something like the following during evaluation:

evaluation_dset = load_dataset("superb", "sd", split="test") # dataset of predictions submission_dset = load_dataset("json", data_files=["output-from-bulk-job.jsonl"]) metric = load_metric("der") metric.compute(predictions=submission_dset["preds"], evaluation_dset[???])

but maybe we can deal with this once we start looking at the metric question :)

@lewtun, as we agreed in our discussion in Slack with @Narsil and in our Google meeting, an additional step is required to transform current schema to the 2D array labels.

Below, I'm writing a summary of our discussion and the points we agreed on.

datasets/superb/superb.py

albertvillanova · 2021-07-29T08:31:37Z

I make a summary about our discussion with @lewtun and @Narsil on the agreed schema for this dataset and the additional steps required to generate the 2D array labels:

The labels for this dataset are a 2D array:
Given an example:
```
{"record_id": record_id, "file": file, "start": start, "end": end, "speakers": [...]}
```
the labels are a 2D array of shape (num_frames, num_speakers) where num_frames = end - start and num_speakers = 2.
In order to avoid a too large dataset (too large disk space), datasets does not store the 2D array label. Instead, we store a compact form:
```
"speakers": [
  {"speaker_id": speaker_0_id, "start": start_0_speaker_0, "end": end_0_speaker_0},
  {"speaker_id": speaker_0_id, "start": start_1_speaker_0, "end": end_1_speaker_0},
  {"speaker_id": speaker_1_id, "start": start_0_speaker_1, "end": end_0_speaker_1},
],
```
- Once loaded the dataset, an additional step is required to generate the 2D array label from this compact form
- This additional step should be a modified version of the s3prl method _get_labeled_speech:
  - Original s3prl _get_labeled_speech includes 2 functionalities: reading the audio file and transforming it into an array, and generating the label 2D array; I think we should separate these 2 functionalities
  - Original s3prl _get_labeled_speech performs 2 steps to generate the labels:
    - Transform start/end seconds (float) into frame numbers (int): I have already done this step to generate the dataset
    - Generate the 2D array label from the frame numbers

I also ping @osanseviero and @lhoestq to include them in the loop.

albertvillanova · 2021-07-29T09:02:22Z

Here I would like to discuss (and agree) one of the decisions I made, as I'm not completely satisfied with it: to transform the seconds (float) into frame numbers (int) to generate this dataset.

A priori, the most natural and general choice would be to preserve the seconds (float), because:
- this is the way the raw data comes from
- the transformation into frame numbers depends on the sample rate, frame_shift and subsampling

However, I finally decided to transform seconds into frame numbers because:

for SUPERB, sampling rate, frame_shift and subsampling are fixed (rate = 16_000, frame_shift = 160, subsampling = 1)
it makes easier the post-processing, as labels are generated from sample numbers: labels are a 2D array of shape (num_frames, num_speakers)
the number of examples depends on the number of frames:
- if an example has more than 2_000 frames, then it is split into 2 examples. This is the case for record_id = "7859-102521-0017_3983-5371-0014", which has 2_452 frames and it is split into 2 examples:
```
{"record_id": "7859-102521-0017_3983-5371-0014", "start"= 0, "end": 2_000,...},
{"record_id": "7859-102521-0017_3983-5371-0014", "start"= 2_000, "end": 2_452,...},
```

As I told you, I'm not totally convinced of this decision, and I would really appreciate your opinion.

cc: @lewtun @Narsil @osanseviero @lhoestq

lhoestq · 2021-07-29T10:14:02Z

It makes total sense to prepare the data to be in a format that can actually be used for model training and evaluation. That's one of the roles of this lib :)

So for me it's ok to use frames as a unit instead of seconds. Just pinging @patrickvonplaten in case he has ever played with such audio tasks and has some advice. For the context: the task is to classify which speaker is speaking, let us know if you are aware of any convenient/standard format for this.

Also I'm not sure why you have to split an example if it's longer that 2,000 frames ?

albertvillanova · 2021-07-29T10:49:21Z

Also I'm not sure why you have to split an example if it's longer that 2,000 frames ?

It is a convention in SUPERB benchmark.

albertvillanova · 2021-07-29T10:55:31Z

Note that if we agree to leave the dataset as it is now, 2 additional custom functions must be used:

one to generate the 2D array labels
one to load the audio file into an array, but taking into account start/end to cut the audio

Is there a way we can give these functions ready to be used? Or should we leave this entirely to the end user? This is not trivial...

lhoestq · 2021-07-29T11:51:45Z

You could add an example of usage in the dataset card, as it is done for other audio datasets

Narsil · 2021-07-29T12:54:52Z

@albertvillanova this simple function can be edited simply to add the start/stop cuts

https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/automatic_speech_recognition.py#L29

lhoestq · 2021-07-29T13:55:41Z

Does this function work on windows ?

Narsil · 2021-07-29T14:11:15Z

Windows ? What is it ? (Not sure not able to test, it's directly calling ffmpeg binary, so depending on the setup it could but can't say for sure without testing)

lhoestq · 2021-07-29T14:17:31Z

It's one of the OS we're supposed to support :P (for the better and for the worse)

lewtun · 2021-07-30T02:34:24Z

Note that if we agree to leave the dataset as it is now, 2 additional custom functions must be used:

one to generate the 2D array labels

one to load the audio file into an array, but taking into account start/end to cut the audio

Is there a way we can give these functions ready to be used? Or should we leave this entirely to the end user? This is not trivial...

+1 on providing the necessary functions on the dataset card. aside from that, the current implementation looks great from my perspective!

lewtun · 2021-07-30T21:22:49Z

datasets/superb/README.md

+
+##### Example of usage
+
+Use these auxiliary functions to:


thanks for adding this! what do you think about showing an end-to-end example like the following?

dset = load_dataset("superb", "sd", split="train") # not sure about this step ... dset = dset.map(load_audio_file) # same here ... dset = dset.map(generate_label)

As discussed, for now we will leave this as it is. We will eventually add an end-to-end example in a future Pull Request, once that it is tested/validated with the inference API + evaluation.

sounds good to me! merge away when you're ready :)

albertvillanova added 30 commits July 16, 2021 18:31

Add condition on config.name in _generate_examples

e6c6872

Add sd config

f19d795

Add sd case in _split_generators

88b728f

Merge remote-tracking branch 'upstream/master' into huggingfacegh-2653

467e530

Fix and download only instead of download_and_extract

2a1bb6d

Fix README

e0313f3

Move features and supervised_keys to SuperbConfig

a35fa65

Add wav.zip to archive_path

1d5e7c7

Remove wav.scp from archive_path

af50db7

Add customized KaldiData from s3prl as SdData

7bb6e5b

Add Sd args class

5aa6577

Use Sd data and args in _generate_examples

c6ee8f8

Add _generate_chunk_indices

229c107

Use _generate_chunk_indices in _generate_examples

fc6630e

Add _get_speakers

d819d11

Use _get_speakers in _generate_examples

883c363

Fix style

6367d58

Add _gen_chunk_indices

7df49b3

Refactor _generate_chunk_indices for test split

0b5e8d5

Refactor _generate_examples for test split

f9f3c9d

Fix style

819c73d

Minor refactor _get_speakers

7cd6528

Add sd Features

2356cf5

Pass encoding to open and use context manager

4a5686c

Do not download unused files

4cd99bc

Fix wav_dir

f393dbc

Fix speaker start/end as number of frames like record start/end

edffe77

Add sd dummy data

637f3b3

Add sd JSON metadata

495a583

Add sd tags and sections in dataset card

52c1bf5

albertvillanova marked this pull request as ready for review July 28, 2021 13:38

lewtun approved these changes Jul 28, 2021

View reviewed changes

albertvillanova added 2 commits July 30, 2021 12:15

Remove commented code

a35eeed

Add auxiliary functions to load audio and generate labels

bb7822d

lewtun reviewed Jul 30, 2021

View reviewed changes

albertvillanova merged commit 171f2bb into huggingface:master Aug 4, 2021

albertvillanova mentioned this pull request Aug 5, 2021

Integrate Hugging Face Hub & add Docker image s3prl/s3prl#160

Merged

anton-l mentioned this pull request Sep 13, 2021

Add IC, SI, ER tasks to SUPERB #2884

Merged

Add SD task for SUPERB #2661

Add SD task for SUPERB #2661

Uh oh!

Conversation

albertvillanova commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

lewtun Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

albertvillanova commented Jul 29, 2021

Uh oh!

albertvillanova commented Jul 29, 2021

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

Narsil commented Jul 29, 2021

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

Narsil commented Jul 29, 2021

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

lewtun commented Jul 30, 2021

Uh oh!

lewtun Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

lewtun Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented Jul 16, 2021 •

edited

Loading

albertvillanova commented Jul 29, 2021 •

edited

Loading

albertvillanova commented Jul 29, 2021 •

edited

Loading

lewtun Jul 30, 2021 •

edited

Loading