-
Notifications
You must be signed in to change notification settings - Fork 3k
Add SD task for SUPERB #2661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SD task for SUPERB #2661
Conversation
lewtun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks a lot for including this complex dataset 😍
| We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using | ||
| alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).""" | ||
| ), | ||
| features=datasets.Features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you happen to know if this schema allows one to easily compute the diarization error rate or is some additional preprocessing required?
if i understand correctly, for each frame, the fine-tuned models will predict logits per speaker so i'm wondering how we can connect this to the current schema?
the reason i'm asking is that ultimately i'd like to do something like the following during evaluation:
evaluation_dset = load_dataset("superb", "sd", split="test")
# dataset of predictions
submission_dset = load_dataset("json", data_files=["output-from-bulk-job.jsonl"])
metric = load_metric("der")
metric.compute(predictions=submission_dset["preds"], evaluation_dset[???])
but maybe we can deal with this once we start looking at the metric question :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I make a summary about our discussion with @lewtun and @Narsil on the agreed schema for this dataset and the additional steps required to generate the 2D array labels:
I also ping @osanseviero and @lhoestq to include them in the loop. |
|
Here I would like to discuss (and agree) one of the decisions I made, as I'm not completely satisfied with it: to transform the seconds (float) into frame numbers (int) to generate this dataset.
However, I finally decided to transform seconds into frame numbers because:
As I told you, I'm not totally convinced of this decision, and I would really appreciate your opinion. |
|
It makes total sense to prepare the data to be in a format that can actually be used for model training and evaluation. That's one of the roles of this lib :) So for me it's ok to use frames as a unit instead of seconds. Just pinging @patrickvonplaten in case he has ever played with such audio tasks and has some advice. For the context: the task is to classify which speaker is speaking, let us know if you are aware of any convenient/standard format for this. Also I'm not sure why you have to split an example if it's longer that 2,000 frames ? |
It is a convention in SUPERB benchmark. |
|
Note that if we agree to leave the dataset as it is now, 2 additional custom functions must be used:
Is there a way we can give these functions ready to be used? Or should we leave this entirely to the end user? This is not trivial... |
|
You could add an example of usage in the dataset card, as it is done for other audio datasets |
|
@albertvillanova this simple function can be edited simply to add the start/stop cuts |
|
Does this function work on windows ? |
|
Windows ? What is it ? (Not sure not able to test, it's directly calling ffmpeg binary, so depending on the setup it could but can't say for sure without testing) |
|
It's one of the OS we're supposed to support :P (for the better and for the worse) |
+1 on providing the necessary functions on the dataset card. aside from that, the current implementation looks great from my perspective! |
|
|
||
| ##### Example of usage | ||
|
|
||
| Use these auxiliary functions to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this! what do you think about showing an end-to-end example like the following?
dset = load_dataset("superb", "sd", split="train")
# not sure about this step ...
dset = dset.map(load_audio_file)
# same here ...
dset = dset.map(generate_label)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, for now we will leave this as it is. We will eventually add an end-to-end example in a future Pull Request, once that it is tested/validated with the inference API + evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me! merge away when you're ready :)
Include the SD (Speaker Diarization) task as described in the SUPERB paper and
s3prlinstructions.TODO:
Add DER metric(we leave the DER metric for a follow-up PR)Related to #2619.
Close #2653.
cc: @lewtun