-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Describe the bug
An error occurs when training (fine-tuning) is performed in the manifest format specified in the training readme.
Steps/Code to reproduce bug
After creating the train, val dataset manifest in the format
-> {"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_frame_marblenet
When loading the model and performing fine tuning, an error occurs during data loading.
Expected behavior
The error message is as follows:
~/nemo/collections/asr/data/audio_to_label.py", line 347, in getitem
t = torch.tensor(self.label2id[sample.label]).long()
KeyError: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
When comparing the previous framevad_add branch and the latest branch code, we can see that EncDecClassificationModel in classification_models.py has changed from inheriting the existing _EncDecBaseModel class to inheriting the EncDecSpeakerLabelModel class.
In this case, instead of using audio_to_label_dataset.get_audio_multi_label_dataset to load the manifest, current branch use audio_to_label_dataset.get_speech_label_dataset,
so, audio_to_label.AudioToSpeechLabelDataset is being used instead of audio_to_label.AudioToMultiLabelDataset.
The class inherits _AudioLabelDataset again and uses getitem of the class, which causes an issue where the vad ground truth recorded per frame cannot be read.
If you copy the EncDecClassificationModel class from the previous branch, training step will be performed, but learning will not proceed due to a variable name mismatch, such as val_acc, in the eval step. Please review this error.