Feature/unsup multichan waveform dataset by kkarrancsu · Pull Request #532 · lhotse-speech/lhotse

kkarrancsu · 2022-01-12T22:00:57Z

Added Dataset which supports multichannel audio-samples. Updated collator to drop pad-tracks.

pzelasko

Thanks! I think it's a good start, will take a bit more effort to handle multi-channel properly here, but we'll get there.

pzelasko · 2022-01-12T22:05:26Z

lhotse/dataset/collation.py

+    # TODO: how to ensure that each track is synced across batches?  i.e. dim=1 is the track index
+    #  and should correspond to the same mic across batches 
+
+    cuts = maybe_pad(cuts)


This should not be needed as you're manually zero-padding later.

pzelasko · 2022-01-12T22:08:05Z

lhotse/dataset/collation.py

+    #  and should correspond to the same mic across batches 
+
+    cuts = maybe_pad(cuts)
+    cuts = remove_pad_tracks(cuts)


I think there is a pitfall here, what if a MixedCut looks like:

|-------cut1-------||---padding---||----cut2----|

or any variation of the situation where padding is in between of two cuts. I don't think Lhotse would handle these situations well with your current code. Maybe you should try only removing the padding at the end (and beginning, but for that one you have to be careful about modifying the offsets on the remaining tracks). Rather than manually removing PaddingCuts, I suggest using .truncate() with carefully computed offset and duration arguments; that method will handle a lot of pitfalls and edge-cases.

pzelasko · 2022-01-12T22:09:38Z

lhotse/dataset/collation.py

+    for idx, cut in enumerate(cuts):
+        ntrack = len(cut.tracks)
+        nsamp = cut.num_samples
+        audio[idx, 0:ntrack, 0:nsamp] = torch.from_numpy(cut.load_audio(mixed=False))


Note that if you did cut.mix(musan_cut) here, it will also add an extra track; as is, the code would not work with additive noise data augmentation

pzelasko · 2022-01-12T22:14:04Z

lhotse/dataset/collation.py

+    cuts = remove_pad_tracks(cuts)
+
+    # NOTE: what to do when the # of tracks is not the same across cuts, right now
+    #  this is zero-padding but that seems bad ...


I think you won't escape zero-padding of examples with less channels if you need to collate the data. However, I suggest you modify this function to return a 3-tuple: (audio, audio_lens, channel_indexes) where audio is the collated data with shape (B, C, T), audio_lens has the length of each multi-channel example of shape (B,), and channel_indexes is a list of lists of which C dim indexes have meaningful channels for examples (it could also be channel_lens tensor of shape (B,) assuming first c channels are always meaningful, if it's possible to guarantee).

But in the end your models will have to somehow deal with the non-meaningful channels anyway. As long as you're working on same-number-of-channels data, no need to overthink this.

pzelasko · 2022-01-12T22:16:43Z

lhotse/dataset/collation.py

+    assert all(isinstance(cut, MixedCut) for cut in cuts)
+
+    # TODO: how to ensure that each track is synced across batches?  i.e. dim=1 is the track index
+    #  and should correspond to the same mic across batches 


You can ensure the tracks are sorted by some property; I imagine this is something very corpus specific and should be done by the user, not by the library.

pzelasko · 2022-01-12T22:18:10Z

lhotse/dataset/unsupervised.py

        assert all(cut.has_recording for cut in cuts)


+class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):


Suggested change

class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):

class MultiChannelWaveformDataset(UnsupervisedDataset):

somehow reads better to me

pzelasko · 2022-01-12T22:19:15Z

lhotse/dataset/unsupervised.py

+                "audio_lens": audio_lens,
+            }
+        else:
+            return {"cuts": cuts, "audio": [c.load_audio(mixed=False) for c in cuts]}


This line would again have the extra padding channels problem. This suggests that maybe the solution should not be (entirely) in the collate function, but inside load_audio, e.g. controlled by an extra argument?

pzelasko · 2022-01-12T22:19:29Z

lhotse/dataset/speech_recognition.py

        of max_frames and max_cuts.
        """
-        validate_for_asr(cuts)
+        #validate_for_asr(cuts)


This should be uncommented

pzelasko · 2022-01-12T22:22:47Z

For anybody interested in this, here's some context of our earlier discussion with @kkarrancsu

I expect you to run into issues related to padding and MUSAN data augmentation with it. Basically, padding and augmentation creates extra tracks in MixedCut, and neither MixedCut nor collate_multi_channel_audio know which tracks are the data, and which tracks are the padding / noise. So, for 4-channel audio, you might end up with 6-channel output from collate_multi_channel_audio; you might want to modify it somehow to avoid that (interpret padding as padding all channels and noise augmentation as… adding the noise to each channel? or just one of them? I don’t know)

pzelasko · 2022-01-12T22:32:36Z

@kkarrancsu I have a different idea -- we could add a new attribute to MixTrack that's called separate_channel: bool = True -- it would indicate if a given track is a "data" channel or an "augmentation" channel. We would add a parameter with the same name to mix (default=True) and pad (default=False) operations on all cuts.

We would need to extend FeatureMixer and AudioMixer to handle these cases properly. The idea is to again add separate_channel parameter to add_to_mix method, and modify the property unmixed_audio:

lhotse/lhotse/audio.py

Lines 984 to 990 in b41e4f8

    
               @property 
        
               def unmixed_audio(self) -> np.ndarray: 
        
                   """ 
        
                   Return a numpy ndarray with the shape (num_tracks, num_samples), where each track is 
        
                   zero padded and scaled adequately to the offsets and SNR used in ``add_to_mix`` call. 
        
                   """ 
        
                   return np.vstack(self.tracks)

so that instead of simply vstacking the right channels, it vstacks only the "separate" channels, downmixes the remaining channels to mono, and adds them to each of the "separate" channels. The analogous operation is needed for FeatureMixer.

Then, collate_multi_channel_audio could remain almost unmodified w.r.t. to what's there in the codebase, and it would not need any special logic to figure out what's a real channel and what is just padding/augmentation.

Of course we'd need to add more unit tests to make sure this doesn't break anything and works as expected.

danpovey · 2022-01-13T10:44:19Z

It seems to me that the more "correct" way to do this would be, when adding noise to multi-channel audio, to add multiple channels of noise. I assume this would require some nontrivial simulation, possibly with multiple sources.
I would have thought that for any application that was going to process multiple-channel audio in a non-trivial way, just adding a single type of noise to all channels would not really be sufficient.

pzelasko · 2022-01-13T14:26:40Z

Good point.. I am not sure if implementing that on top of MixedCut makes sense though, as we would need to hold all the information relating to the "nontrivial simulation" in the manifests. It might make more sense to write a dedicated module/transform that works directly on audio data (or use an existing tool).

One such tool is e.g. https://github.com/asteroid-team/torch-audiomentations, but I just noticed that they are doing exactly the same simplified mono downmix I was thinking about:

https://github.com/asteroid-team/torch-audiomentations/blob/261015b3fdd99b475507aab01456093e13719519/torch_audiomentations/augmentations/background_noise.py#L13-L21

Another option is using https://github.com/LCAV/pyroomacoustics as a transform inside your PyTorch Dataset class I think.

In any case, we would still need to be able to handle the padding. I think the solution I suggested with separate_channel is complimentary to using more advanced simulations later in the data pipeline (e.g., manifests only contain the padding information, MixedCut loads the audio and doesn't add the extra channel for padding, and noise+simulation is added via a transform inside Dataset).

danpovey · 2022-01-13T14:30:18Z

OK, sure, it was just a thought.

kkarrancsu added 5 commits January 8, 2022 08:29

hack for ami

2293f32

Merge branch 'lhotse-speech:master' into master

bf8856f

Merge branch 'lhotse-speech:master' into master

d491251

init commit

2cf48c9

removed hack

72b4b46

pzelasko reviewed Jan 12, 2022

View reviewed changes

pzelasko mentioned this pull request Jan 26, 2022

collate_multi_channel_audio #552

Closed

		assert all(cut.has_recording for cut in cuts)


		class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):

	class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):
	class MultiChannelWaveformDataset(UnsupervisedDataset):

Conversation

kkarrancsu commented Jan 12, 2022

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko commented Jan 12, 2022

Uh oh!

danpovey commented Jan 13, 2022

Uh oh!

pzelasko commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey commented Jan 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pzelasko commented Jan 12, 2022 •

edited

Loading

pzelasko commented Jan 13, 2022 •

edited

Loading