Speech to text model trained from scratch generated empty outputs #164480

chrisanima · 2025-06-29T04:54:55Z

chrisanima
Jun 29, 2025

Select Topic Area

Question

Body

I am fascinated by recent audio modeling achievements by OpenAI and Meta. I experimented with the SpeechT5ForSpeechToText (ASR) model from HuggingFace using Librispeech and Multidialog datasets for speech-to-text by training it from scratch, following the research paper over the month. But none of them worked. The outputs are empty strings. By visualizing hidden states at higher layers of their encoders after up to several dozen epochs of training, I found that they all became the same intensity horizontally over the time axis at each feature dim, although intensities between feature dims differ. The outputs, including logits, from the decoder are in the same scenario. I am looking for enlightenment.

ag863k · 2025-06-29T08:05:07Z

ag863k
Jun 29, 2025

If you're seeing empty string outputs and flat hidden states when training SpeechT5 from scratch, it's a strong sign that the model isn’t learning meaningful representations this is typically a case of training collapse.

One common reason is incorrect audio preprocessing; SpeechT5 expects 16kHz audio converted to log-Mel features in a very specific format.

Training SpeechT5 from scratch is especially hard because it's a large model, so starting with pretrained weights is strongly recommended. To debug further, visualize attention maps and check that the loss is decreasing meaningfully in early epochs.

1 reply

chrisanima Jun 29, 2025
Author

Thank you for your reply. SpeechT5ForSpeechToText from HuggingFace takes waveform audio samples at 16KHz as inputs and uses the convolutional network to extract features directly without STFT, and then projects them onto the model's hidden feature space. I ensured 16KHz samples. So, audio preprocessing should not be a problem. I have also used CTC along with CE for training. I trained it from scratch in order to understand its learning behaviour and what results (WER/CER) can be achieved under my data and training configurations. You are right. Audio modeling is the most difficult to train compared to image and text sequence modeling, which my previous experience is at. It seems to me that the Transformer model and the loss functions aren't efficient for learning a high-dimensional audio space (encoder side particularly). I will check the attention maps as you suggested. Thanks.

sherajsharif · 2025-06-29T09:51:37Z

sherajsharif
Jun 29, 2025

I’ve been exploring recent audio modeling advances by OpenAI and Meta and attempted to train the SpeechT5ForSpeechToText model from scratch using the Librispeech and Multidialog datasets, but despite training for several dozen epochs, the model consistently outputs empty strings, and deeper analysis revealed that hidden states across the time axis become uniformly flat at each feature dimension — suggesting the model is collapsing and failing to learn meaningful representations.

1 reply

chrisanima Jun 29, 2025
Author

Thank you for sharing your experience. It is the same issue as I experienced. This could be a very interesting research topic. Maybe, we should have a chat to understand why first, and collaborate on research to find a more effective way, whether modeling and/or learning methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Speech to text model trained from scratch generated empty outputs #164480

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Speech to text model trained from scratch generated empty outputs #164480

Uh oh!

chrisanima Jun 29, 2025

Select Topic Area

Body

Replies: 3 comments · 4 replies

Uh oh!

ag863k Jun 29, 2025

Uh oh!

chrisanima Jun 29, 2025 Author

Uh oh!

sherajsharif Jun 29, 2025

Uh oh!

chrisanima Jun 29, 2025 Author

chrisanima
Jun 29, 2025

Replies: 3 comments 4 replies

ag863k
Jun 29, 2025

chrisanima Jun 29, 2025
Author

sherajsharif
Jun 29, 2025

chrisanima Jun 29, 2025
Author