Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Jul 16, 2021

Include the SD (Speaker Diarization) task as described in the SUPERB paper and s3prl instructions.

TODO:

  • Generate the LibriMix corpus
  • Prepare the corpus for diarization
  • Upload these files to the superb-data repo
  • Transcribe the corresponding s3prl processing of these files into our superb loading script
  • README: tags + description sections
  • Add DER metric (we leave the DER metric for a follow-up PR)

Related to #2619.

Close #2653.

cc: @lewtun

@albertvillanova albertvillanova marked this pull request as ready for review July 28, 2021 13:38
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks a lot for including this complex dataset 😍

We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using
alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER)."""
),
features=datasets.Features(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you happen to know if this schema allows one to easily compute the diarization error rate or is some additional preprocessing required?

if i understand correctly, for each frame, the fine-tuned models will predict logits per speaker so i'm wondering how we can connect this to the current schema?

the reason i'm asking is that ultimately i'd like to do something like the following during evaluation:

evaluation_dset = load_dataset("superb", "sd", split="test")
# dataset of predictions
submission_dset = load_dataset("json", data_files=["output-from-bulk-job.jsonl"])

metric = load_metric("der")
metric.compute(predictions=submission_dset["preds"], evaluation_dset[???])

but maybe we can deal with this once we start looking at the metric question :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewtun, as we agreed in our discussion in Slack with @Narsil and in our Google meeting, an additional step is required to transform current schema to the 2D array labels.

Below, I'm writing a summary of our discussion and the points we agreed on.

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 29, 2021

I make a summary about our discussion with @lewtun and @Narsil on the agreed schema for this dataset and the additional steps required to generate the 2D array labels:

  • The labels for this dataset are a 2D array:
    Given an example:
    {"record_id": record_id, "file": file, "start": start, "end": end, "speakers": [...]}
    the labels are a 2D array of shape (num_frames, num_speakers) where num_frames = end - start and num_speakers = 2.
  • In order to avoid a too large dataset (too large disk space), datasets does not store the 2D array label. Instead, we store a compact form:
    "speakers": [
      {"speaker_id": speaker_0_id, "start": start_0_speaker_0, "end": end_0_speaker_0},
      {"speaker_id": speaker_0_id, "start": start_1_speaker_0, "end": end_1_speaker_0},
      {"speaker_id": speaker_1_id, "start": start_0_speaker_1, "end": end_0_speaker_1},
    ],
    
    • Once loaded the dataset, an additional step is required to generate the 2D array label from this compact form
    • This additional step should be a modified version of the s3prl method _get_labeled_speech:
      • Original s3prl _get_labeled_speech includes 2 functionalities: reading the audio file and transforming it into an array, and generating the label 2D array; I think we should separate these 2 functionalities
      • Original s3prl _get_labeled_speech performs 2 steps to generate the labels:
        • Transform start/end seconds (float) into frame numbers (int): I have already done this step to generate the dataset
        • Generate the 2D array label from the frame numbers

I also ping @osanseviero and @lhoestq to include them in the loop.

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 29, 2021

Here I would like to discuss (and agree) one of the decisions I made, as I'm not completely satisfied with it: to transform the seconds (float) into frame numbers (int) to generate this dataset.

  • A priori, the most natural and general choice would be to preserve the seconds (float), because:
    • this is the way the raw data comes from
    • the transformation into frame numbers depends on the sample rate, frame_shift and subsampling

However, I finally decided to transform seconds into frame numbers because:

  • for SUPERB, sampling rate, frame_shift and subsampling are fixed (rate = 16_000, frame_shift = 160, subsampling = 1)
  • it makes easier the post-processing, as labels are generated from sample numbers: labels are a 2D array of shape (num_frames, num_speakers)
  • the number of examples depends on the number of frames:
    • if an example has more than 2_000 frames, then it is split into 2 examples. This is the case for record_id = "7859-102521-0017_3983-5371-0014", which has 2_452 frames and it is split into 2 examples:
      {"record_id": "7859-102521-0017_3983-5371-0014", "start"= 0, "end": 2_000,...},
      {"record_id": "7859-102521-0017_3983-5371-0014", "start"= 2_000, "end": 2_452,...},
      

As I told you, I'm not totally convinced of this decision, and I would really appreciate your opinion.

cc: @lewtun @Narsil @osanseviero @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Jul 29, 2021

It makes total sense to prepare the data to be in a format that can actually be used for model training and evaluation. That's one of the roles of this lib :)

So for me it's ok to use frames as a unit instead of seconds. Just pinging @patrickvonplaten in case he has ever played with such audio tasks and has some advice. For the context: the task is to classify which speaker is speaking, let us know if you are aware of any convenient/standard format for this.

Also I'm not sure why you have to split an example if it's longer that 2,000 frames ?

@albertvillanova
Copy link
Member Author

Also I'm not sure why you have to split an example if it's longer that 2,000 frames ?

It is a convention in SUPERB benchmark.

@albertvillanova
Copy link
Member Author

Note that if we agree to leave the dataset as it is now, 2 additional custom functions must be used:

  • one to generate the 2D array labels
  • one to load the audio file into an array, but taking into account start/end to cut the audio

Is there a way we can give these functions ready to be used? Or should we leave this entirely to the end user? This is not trivial...

@lhoestq
Copy link
Member

lhoestq commented Jul 29, 2021

You could add an example of usage in the dataset card, as it is done for other audio datasets

@Narsil
Copy link
Contributor

Narsil commented Jul 29, 2021

@albertvillanova this simple function can be edited simply to add the start/stop cuts

https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/automatic_speech_recognition.py#L29

@lhoestq
Copy link
Member

lhoestq commented Jul 29, 2021

Does this function work on windows ?

@Narsil
Copy link
Contributor

Narsil commented Jul 29, 2021

Windows ? What is it ? (Not sure not able to test, it's directly calling ffmpeg binary, so depending on the setup it could but can't say for sure without testing)

@lhoestq
Copy link
Member

lhoestq commented Jul 29, 2021

It's one of the OS we're supposed to support :P (for the better and for the worse)

@lewtun
Copy link
Member

lewtun commented Jul 30, 2021

Note that if we agree to leave the dataset as it is now, 2 additional custom functions must be used:

  • one to generate the 2D array labels
  • one to load the audio file into an array, but taking into account start/end to cut the audio

Is there a way we can give these functions ready to be used? Or should we leave this entirely to the end user? This is not trivial...

+1 on providing the necessary functions on the dataset card. aside from that, the current implementation looks great from my perspective!


##### Example of usage

Use these auxiliary functions to:
Copy link
Member

@lewtun lewtun Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! what do you think about showing an end-to-end example like the following?

dset = load_dataset("superb", "sd", split="train")
# not sure about this step ...
dset = dset.map(load_audio_file)
# same here ...
dset = dset.map(generate_label)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, for now we will leave this as it is. We will eventually add an end-to-end example in a future Pull Request, once that it is tested/validated with the inference API + evaluation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me! merge away when you're ready :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SD task for SUPERB

4 participants