Speech to text model trained from scratch generated empty outputs #164480
Replies: 3 comments 4 replies
-
| If you're seeing empty string outputs and flat hidden states when training SpeechT5 from scratch, it's a strong sign that the model isn’t learning meaningful representations this is typically a case of training collapse. One common reason is incorrect audio preprocessing; SpeechT5 expects 16kHz audio converted to log-Mel features in a very specific format. Training SpeechT5 from scratch is especially hard because it's a large model, so starting with pretrained weights is strongly recommended. To debug further, visualize attention maps and check that the loss is decreasing meaningfully in early epochs. | 
Beta Was this translation helpful? Give feedback.
-
| I’ve been exploring recent audio modeling advances by OpenAI and Meta and attempted to train the SpeechT5ForSpeechToText model from scratch using the Librispeech and Multidialog datasets, but despite training for several dozen epochs, the model consistently outputs empty strings, and deeper analysis revealed that hidden states across the time axis become uniformly flat at each feature dimension — suggesting the model is collapsing and failing to learn meaningful representations. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Select Topic Area
Question
Body
I am fascinated by recent audio modeling achievements by OpenAI and Meta. I experimented with the SpeechT5ForSpeechToText (ASR) model from HuggingFace using Librispeech and Multidialog datasets for speech-to-text by training it from scratch, following the research paper over the month. But none of them worked. The outputs are empty strings. By visualizing hidden states at higher layers of their encoders after up to several dozen epochs of training, I found that they all became the same intensity horizontally over the time axis at each feature dim, although intensities between feature dims differ. The outputs, including logits, from the decoder are in the same scenario. I am looking for enlightenment.
Beta Was this translation helpful? Give feedback.
All reactions