Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e6c6872
Add condition on config.name in _generate_examples
albertvillanova Jul 16, 2021
f19d795
Add sd config
albertvillanova Jul 16, 2021
88b728f
Add sd case in _split_generators
albertvillanova Jul 16, 2021
467e530
Merge remote-tracking branch 'upstream/master' into gh-2653
albertvillanova Jul 19, 2021
2a1bb6d
Fix and download only instead of download_and_extract
albertvillanova Jul 19, 2021
e0313f3
Fix README
albertvillanova Jul 19, 2021
a35fa65
Move features and supervised_keys to SuperbConfig
albertvillanova Jul 21, 2021
1d5e7c7
Add wav.zip to archive_path
albertvillanova Jul 21, 2021
af50db7
Remove wav.scp from archive_path
albertvillanova Jul 21, 2021
7bb6e5b
Add customized KaldiData from s3prl as SdData
albertvillanova Jul 21, 2021
5aa6577
Add Sd args class
albertvillanova Jul 22, 2021
c6ee8f8
Use Sd data and args in _generate_examples
albertvillanova Jul 22, 2021
229c107
Add _generate_chunk_indices
albertvillanova Jul 22, 2021
fc6630e
Use _generate_chunk_indices in _generate_examples
albertvillanova Jul 22, 2021
d819d11
Add _get_speakers
albertvillanova Jul 26, 2021
883c363
Use _get_speakers in _generate_examples
albertvillanova Jul 26, 2021
6367d58
Fix style
albertvillanova Jul 26, 2021
7df49b3
Add _gen_chunk_indices
albertvillanova Jul 26, 2021
0b5e8d5
Refactor _generate_chunk_indices for test split
albertvillanova Jul 26, 2021
f9f3c9d
Refactor _generate_examples for test split
albertvillanova Jul 26, 2021
819c73d
Fix style
albertvillanova Jul 26, 2021
7cd6528
Minor refactor _get_speakers
albertvillanova Jul 26, 2021
2356cf5
Add sd Features
albertvillanova Jul 26, 2021
4a5686c
Pass encoding to open and use context manager
albertvillanova Jul 26, 2021
4cd99bc
Do not download unused files
albertvillanova Jul 28, 2021
f393dbc
Fix wav_dir
albertvillanova Jul 28, 2021
edffe77
Fix speaker start/end as number of frames like record start/end
albertvillanova Jul 28, 2021
637f3b3
Add sd dummy data
albertvillanova Jul 28, 2021
495a583
Add sd JSON metadata
albertvillanova Jul 28, 2021
52c1bf5
Add sd tags and sections in dataset card
albertvillanova Jul 28, 2021
a35eeed
Remove commented code
albertvillanova Jul 30, 2021
bb7822d
Add auxiliary functions to load audio and generate labels
albertvillanova Jul 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 71 additions & 9 deletions datasets/superb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ size_categories:
source_datasets:
- original
- extended|librispeech_asr
- extended|other-librimix
task_categories:
- speech-processing
task_ids:
Expand Down Expand Up @@ -106,7 +107,46 @@ Automatic Speaker Verification (ASV) verifies whether the speakers of a pair of

#### sd

Speaker Diarization (SD) predicts who is speaking when for each timestamp, and multiple speakers can speak simultaneously. The model has to encode rich speaker characteristics for each frame and should be able to represent mixtures of signals. [LibriMix](https://github.com/s3prl/s3prl/tree/master/downstream#sd-speaker-diarization) is adopted where LibriSpeech train-clean-100/dev-clean/test-clean are used to generate mixtures for training/validation/testing. We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).
Speaker Diarization (SD) predicts *who is speaking when* for each timestamp, and multiple speakers can speak simultaneously. The model has to encode rich speaker characteristics for each frame and should be able to represent mixtures of signals. [LibriMix](https://github.com/s3prl/s3prl/tree/master/downstream#sd-speaker-diarization) is adopted where LibriSpeech train-clean-100/dev-clean/test-clean are used to generate mixtures for training/validation/testing. We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).

##### Example of usage

Use these auxiliary functions to:
Copy link
Member

@lewtun lewtun Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! what do you think about showing an end-to-end example like the following?

dset = load_dataset("superb", "sd", split="train")
# not sure about this step ...
dset = dset.map(load_audio_file)
# same here ...
dset = dset.map(generate_label)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, for now we will leave this as it is. We will eventually add an end-to-end example in a future Pull Request, once that it is tested/validated with the inference API + evaluation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me! merge away when you're ready :)

- load the audio file into an audio data array
- generate the label array

```python
def load_audio_file(example, frame_shift=160):
import soundfile as sf

example["array"], example["sample_rate"] = sf.read(
example["file"], start=example["start"] * frame_shift, stop=example["end"] * frame_shift
)
return example


def generate_label(example, frame_shift=160, num_speakers=2, rate=16000):
import numpy as np

start = example["start"]
end = example["end"]
frame_num = end - start
speakers = sorted({speaker["speaker_id"] for speaker in example["speakers"]})
label = np.zeros((frame_num, num_speakers), dtype=np.int32)
for speaker in example["speakers"]:
speaker_index = speakers.index(speaker["speaker_id"])
start_frame = np.rint(speaker["start"] * rate / frame_shift).astype(int)
end_frame = np.rint(speaker["end"] * rate / frame_shift).astype(int)
rel_start = rel_end = None
if start <= start_frame < end:
rel_start = start_frame - start
if start < end_frame <= end:
rel_end = end_frame - start
if rel_start is not None or rel_end is not None:
label[rel_start:rel_end, speaker_index] = 1
example["label"] = label
return example
```

#### er

Expand All @@ -130,7 +170,7 @@ The language data in SUPERB is in English (BCP-47 `en`)

An example from each split looks like:

```json
```python
{'chapter_id': 1240,
'file': 'path/to/file.flac',
'id': '103-1240-0000',
Expand Down Expand Up @@ -173,7 +213,19 @@ An example from each split looks like:

#### sd

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
An example from each split looks like:
```python
{
'record_id': '1578-6379-0038_6415-111615-0009',
'file': 'path/to/file.wav',
'start': 0,
'end': 1590,
'speakers': [
{'speaker_id': '1578', 'start': 28, 'end': 657},
{'speaker_id': '6415', 'start': 28, 'end': 1576}
]
}
```


#### er
Expand All @@ -194,9 +246,9 @@ An example from each split looks like:

- `file`: a `string` feature.
- `text`: a `string` feature.
- `speaker_id`: a `int64` feature
- `chapter_id`: a `int64` feature
- `id`: a `string` feature
- `speaker_id`: a `int64` feature.
- `chapter_id`: a `int64` feature.
- `id`: a `string` feature.

#### ks

Expand Down Expand Up @@ -230,8 +282,15 @@ An example from each split looks like:

#### sd

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The data fields in all splits are:
- `record_id` (`string`): ID of the record.
- `file` (`string`): Path to the WAV audio file.
- `start` (`integer`): Start frame of the audio.
- `end` (`integer`): End frame of the audio.
- `speakers` (`list` of `dict`): List of speakers in the audio. Each item contains the fields:
- `speaker_id` (`string`): ID of the speaker.
- `start` (`integer`): Frame when the speaker starts speaking.
- `end` (`integer`): Frame when the speaker stops speaking.

#### er

Expand Down Expand Up @@ -282,8 +341,11 @@ An example from each split looks like:

#### sd

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The data is split into "train", "dev" and "test" sets, each containing the following number of examples:

| | train | dev | test |
|----|------:|-----:|-----:|
| sd | 13901 | 3014 | 3002 |

#### er

Expand Down
Loading