You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SpeechLM2 collection is still in active development and the code is likely to keep changing.
6
6
7
+
8
+
7
9
SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities.
8
10
9
11
This collection is designed to be compact, efficient, and to support easy swapping of different LLMs backed by HuggingFace AutoModel.
10
12
It has a first-class support for using dynamic batch sizes via Lhotse and various model parallelism techniques (e.g., FSDP2, Tensor Parallel, Sequence Parallel) via PyTorch DTensor API.
11
13
12
-
We currently support four main model types:
13
-
* SALM (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
14
-
* DuplexS2SModel - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
15
-
* DuplexS2SSpeechDecoderModel - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.
16
-
* DuplexSTTModel - a decoder model to generate agent text in duplex, in response to both user speech and text inputs.
14
+
We currently support six main model types:
15
+
16
+
* **SALM** (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
17
+
* **DuplexS2SModel** - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
18
+
* **DuplexS2SSpeechDecoderModel** - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.
19
+
* **DuplexEARTTS** - a ready-to-use duplex text-to-speech model that supports user interruption via a special text interruption token.
20
+
* **DuplexSTTModel** - a decoder model to generate agent text in duplex, in response to both user speech and text inputs.
21
+
* **NemotronVoiceChat** - an *inference-only* pipeline that seamlessly merges `DuplexSTTModel` and `DuplexEARTTS` to deliver an end-to-end, full-duplex conversational agent with high-fidelity speech generation.
17
22
18
23
Using Pretrained Models
19
24
-----------------------
@@ -148,10 +153,100 @@ You can run inference using the loaded pretrained DuplexSTTModel:
148
153
transcription = results["text"][0]
149
154
print(f"Transcription: {transcription}")
150
155
156
+
DuplexEARTTS
157
+
************
158
+
159
+
Because `DuplexEARTTS` relies on precise token padding and EOS placement to handle potential user interruptions, inference and evaluation are handled via the `duplex_eartts_eval.py` script following the MagpieTTS dataset format recipe.
160
+
161
+
The evaluation script processes a `JSONL` file where each line is a dictionary containing the text, the reference audio for the speaker, and the desired output audio filename.
162
+
163
+
**JSONL Format Examples:**
164
+
165
+
Single-Turn format (evaluates a continuous string):
166
+
167
+
.. code-block:: json
168
+
169
+
{"text": "Like really quickly and then they run off.", "context_audio_filepath": "speaker_1.wav", "audio_filepath": "audio_1.wav"}
170
+
171
+
Multi-Turn format (evaluates sequential conversational turns, padded incrementally):
172
+
173
+
.. code-block:: json
174
+
175
+
{"text": ["Yes.", "Sure.", "Right.", "I get what you’re saying."], "context_audio_filepath": "speaker_2.wav", "audio_filepath": "audio_2.wav"}
The script will decode the text, apply the target speaker conditioning, generate the resulting audio waveforms into `out_dir`, and compute ASR intelligibility metrics (CER/WER) on the generated speech.
190
+
191
+
NemotronVoiceChat
192
+
*****************
193
+
194
+
You can evaluate and run full-duplex inference using the `NemotronVoiceChat` pipeline. This model natively chains the `DuplexSTTModel` with the `DuplexEARTTS` speech decoder for an end-to-end response:
195
+
196
+
.. code-block:: python
197
+
198
+
import torch
199
+
import torchaudio
200
+
import nemo.collections.speechlm2 as slm
201
+
202
+
model = slm.models.NemotronVoiceChat.from_pretrained("path/to/pretrained_checkpoint").eval()
# speaker_audio=speaker_audio, # Pass speaker reference if available
232
+
# speaker_audio_lens=speaker_len
233
+
)
234
+
235
+
# Decode the predicted text and generated speech waveform
236
+
generated_text = results["text"][0]
237
+
generated_speech = results["audio"][0]
238
+
239
+
print(f"Agent response: {generated_text}")
240
+
# generated_speech can now be saved or played (sampled at model.target_sample_rate)
241
+
242
+
151
243
Training a Model
152
244
----------------
153
245
154
-
This example demonstrates how to train a SALM model. The remaining models can be trained in a similar manner.
246
+
This example demonstrates how to train a SALM model.
247
+
248
+
.. note::
249
+
**NemotronVoiceChat is an inference-only class.** It does not implement a `training_step` and cannot be trained using the pipeline below. To update its underlying capabilities, you must train the `DuplexSTTModel` and `DuplexEARTTS` models independently.
155
250
156
251
.. code-block:: python
157
252
@@ -207,7 +302,7 @@ Alternatively, you can train a model using the provided training scripts in the
207
302
--config-path=examples/speechlm2/conf \
208
303
--config-name=salm
209
304
210
-
# For inference/evaluation
305
+
# For SALM inference/evaluation
211
306
python examples/speechlm2/salm_eval.py \
212
307
pretrained_name=/path/to/checkpoint \
213
308
inputs=/path/to/test_manifest \
@@ -222,9 +317,9 @@ Collection Structure
222
317
223
318
The speechlm2 collection is organized into the following key components:
224
319
225
-
- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, and SALM
226
-
- **Modules**: Contains audio perception and speech generation modules
227
-
- **Data**: Includes dataset classes and data loading utilities
320
+
- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, SALM, DuplexEARTTS, and the inference-only NemotronVoiceChat.
321
+
- **Modules**: Contains audio perception and speech generation modules.
322
+
- **Data**: Includes dataset classes and data loading utilities.
Copy file name to clipboardExpand all lines: docs/source/speechlm2/models.rst
+21Lines changed: 21 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,6 +99,24 @@ This model is particularly useful for:
99
99
* Duplex systems where text responses are needed instead of speech
100
100
* Applications requiring transcript generation from spoken dialogue
101
101
102
+
103
+
NemotronVoiceChat
104
+
^^^^^^^^^^^^^^^^^
105
+
106
+
NemotronVoiceChat is an **inference-only**, end-to-end Duplex Speech-to-Speech pipeline. It achieves full-duplex conversational capabilities by seamlessly merging the `DuplexSTTModel` with the `DuplexEARTTS` model.
107
+
108
+
Because it is designed exclusively for evaluation, offline inference, and validation workflows (no training step is implemented), it is highly optimized for executing the full perception-generation-synthesis loop.
109
+
110
+
Key components:
111
+
112
+
* **DuplexSTTModel**: Handles the streaming audio perception and text response generation.
113
+
* **DuplexEARTTS**: Serves as the autoregressive speech decoder, generating high-fidelity audio from the STT model's text tokens in a streamable fashion.
114
+
115
+
This model is particularly useful for:
116
+
* End-to-end evaluation of the complete speech-to-speech pipeline.
117
+
* Offline speech-to-speech inference workflows.
118
+
119
+
102
120
Model Components
103
121
----------------
104
122
@@ -247,6 +265,9 @@ All models in the speechlm2 collection can be instantiated from pretrained check
_target_: lightning.pytorch.strategies.DDPStrategy # Distributed Data Parallel strategy for multi-GPU inference
32
+
gradient_as_bucket_view: true # Memory optimization for DDP
33
+
find_unused_parameters: true # Required if parts of the model (like text-only branches) don't receive gradients/usage
34
+
35
+
data:
36
+
frame_length: 0.08# Duration of a single audio frame in seconds (80ms)
37
+
source_sample_rate: 16000# Sample rate of the input/user audio prompts (16 kHz)
38
+
target_sample_rate: 22050# Sample rate of the generated output speech (22.05 kHz)
39
+
input_roles: ["user", "User"] # Conversation roles mapped to the input prompt
40
+
output_roles: ["agent", "Assistant", "assistant","Agent"] # Conversation roles the model is tasked with generating
41
+
42
+
validation_ds:
43
+
datasets:
44
+
evaluation_set:
45
+
shar_path: /lustre/fsw/portfolios/llmservice/users/kevinhu/duplex/ultrachat_v2/shar_duplex/manifest_000020 # Path to the Lhotse WebDataset tar shards manifest
46
+
47
+
sample_rate: ${data.target_sample_rate} # Audio will be resampled to this rate if necessary
48
+
batch_size: 4# Number of samples processed per GPU during evaluation
49
+
seed: 42# Random seed for reproducibility
50
+
shard_seed: "randomized"# Ensures distributed workers get different data shards
51
+
52
+
exp_manager:
53
+
explicit_log_dir: nemotron_voicechat_log_dir/ # Root directory where evaluation metrics, JSON logs, and generated audio will be saved
54
+
name: nemotron-voicechat-eval # Name of the experiment
55
+
create_tensorboard_logger: false # Toggle for TensorBoard logging
56
+
create_checkpoint_callback: false # Enables the checkpoint callback module
57
+
use_datetime_version: true # Appends a timestamp to the log directory name
0 commit comments