NVIDIA-NeMo
diff --git a/‎docs/source/speechlm2/intro.rst‎
Lines changed: 105 additions & 10 deletions b/‎docs/source/speechlm2/intro.rst‎
Lines changed: 105 additions & 10 deletions
diff --git a/‎docs/source/speechlm2/models.rst‎
Lines changed: 21 additions & 0 deletions b/‎docs/source/speechlm2/models.rst‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎docs/source/tts/magpietts-longform.rst‎
Lines changed: 8 additions & 8 deletions b/‎docs/source/tts/magpietts-longform.rst‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/source/tts/magpietts-po.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/tts/magpietts-po.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/speechlm2/conf/nemotron_voicechat.yaml‎
Lines changed: 57 additions & 0 deletions b/‎examples/speechlm2/conf/nemotron_voicechat.yaml‎
Lines changed: 57 additions & 0 deletions
@@ -4,16 +4,21 @@ SpeechLM2
 .. note::
    The SpeechLM2 collection is still in active development and the code is likely to keep changing.
 
+
+
 SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities. 
 
 This collection is designed to be compact, efficient, and to support easy swapping of different LLMs backed by HuggingFace AutoModel. 
 It has a first-class support for using dynamic batch sizes via Lhotse and various model parallelism techniques (e.g., FSDP2, Tensor Parallel, Sequence Parallel) via PyTorch DTensor API.
 
-We currently support four main model types:
-* SALM (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
-* DuplexS2SModel - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
-* DuplexS2SSpeechDecoderModel - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.
-* DuplexSTTModel - a decoder model to generate agent text in duplex, in response to both user speech and text inputs.
+We currently support six main model types:
+
+* **SALM** (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
+* **DuplexS2SModel** - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
+* **DuplexS2SSpeechDecoderModel** - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.
+* **DuplexEARTTS** - a ready-to-use duplex text-to-speech model that supports user interruption via a special text interruption token.
+* **DuplexSTTModel** - a decoder model to generate agent text in duplex, in response to both user speech and text inputs.
+* **NemotronVoiceChat** - an *inference-only* pipeline that seamlessly merges `DuplexSTTModel` and `DuplexEARTTS` to deliver an end-to-end, full-duplex conversational agent with high-fidelity speech generation.
 
 Using Pretrained Models
 -----------------------
@@ -148,10 +153,100 @@ You can run inference using the loaded pretrained DuplexSTTModel:
     transcription = results["text"][0]
     print(f"Transcription: {transcription}")
 
+DuplexEARTTS
+************
+
+Because `DuplexEARTTS` relies on precise token padding and EOS placement to handle potential user interruptions, inference and evaluation are handled via the `duplex_eartts_eval.py` script following the MagpieTTS dataset format recipe. 
+
+The evaluation script processes a `JSONL` file where each line is a dictionary containing the text, the reference audio for the speaker, and the desired output audio filename. 
+
+**JSONL Format Examples:**
+
+Single-Turn format (evaluates a continuous string):
+
+.. code-block:: json
+
+    {"text": "Like really quickly and then they run off.", "context_audio_filepath": "speaker_1.wav", "audio_filepath": "audio_1.wav"}
+
+Multi-Turn format (evaluates sequential conversational turns, padded incrementally):
+
+.. code-block:: json
+
+    {"text": ["Yes.", "Sure.", "Right.", "I get what you’re saying."], "context_audio_filepath": "speaker_2.wav", "audio_filepath": "audio_2.wav"}
+
+**Running the Evaluation/Inference Script:**
+
+.. code-block:: bash
+
+    python examples/speechlm2/duplex_eartts_eval.py \
+        --config-path=conf/ \
+        --config-name=duplex_eartts.yaml \
+        ++checkpoint_path=/path/to/duplex_eartts/model.ckpt \
+        ++datasets_json_path=/path/to/evalset_config.jsonl \
+        ++out_dir=/path/to/output/audio_samples/ \
+        ++user_custom_speaker_reference=/path/to/optional_override_speaker.wav
+
+The script will decode the text, apply the target speaker conditioning, generate the resulting audio waveforms into `out_dir`, and compute ASR intelligibility metrics (CER/WER) on the generated speech.
+
+NemotronVoiceChat
+*****************
+
+You can evaluate and run full-duplex inference using the `NemotronVoiceChat` pipeline. This model natively chains the `DuplexSTTModel` with the `DuplexEARTTS` speech decoder for an end-to-end response:
+
+.. code-block:: python
+
+    import torch
+    import torchaudio
+    import nemo.collections.speechlm2 as slm
+
+    model = slm.models.NemotronVoiceChat.from_pretrained("path/to/pretrained_checkpoint").eval()
+
+    # Load user audio prompt
+    audio_path = "path/to/user_audio.wav"
+    audio_signal, sample_rate = torchaudio.load(audio_path)
+
+    # Resample to the source_sample_rate (usually 16kHz for STT perception)
+    if sample_rate != 16000:
+        audio_signal = torchaudio.functional.resample(audio_signal, sample_rate, 16000)
+        sample_rate = 16000
+
+    # Prepare audio for model
+    audio_signal = audio_signal.to(model.device)
+    audio_len = torch.tensor([audio_signal.shape[1]], device=model.device)
+
+    # (Optional) Load an explicit speaker reference audio to condition the agent's voice
+    # speaker_audio, _ = torchaudio.load("path/to/speaker_reference.wav")
+    # speaker_audio = speaker_audio.to(model.device)
+    # speaker_len = torch.tensor([speaker_audio.shape[1]], device=model.device)
+
+    # Note: If an explicit audio reference is not passed into `offline_inference`, 
+    # the model relies on the internal config parameters:
+    # 1. model.cfg.inference_speaker_name (Highest priority preset, e.g., 'Megan')
+    # 2. model.cfg.inference_speaker_reference (Fallback audio file path)
+        
+    # Run full offline inference
+    results = model.offline_inference(
+        input_signal=audio_signal,
+        input_signal_lens=audio_len,
+        # speaker_audio=speaker_audio,       # Pass speaker reference if available
+        # speaker_audio_lens=speaker_len
+    )
+
+    # Decode the predicted text and generated speech waveform
+    generated_text = results["text"][0]
+    generated_speech = results["audio"][0]
+    
+    print(f"Agent response: {generated_text}")
+    # generated_speech can now be saved or played (sampled at model.target_sample_rate)
+
+
 Training a Model
 ----------------
 
-This example demonstrates how to train a SALM model. The remaining models can be trained in a similar manner.
+This example demonstrates how to train a SALM model. 
+
+.. note::
+   **NemotronVoiceChat is an inference-only class.** It does not implement a `training_step` and cannot be trained using the pipeline below. To update its underlying capabilities, you must train the `DuplexSTTModel` and `DuplexEARTTS` models independently.
 
 .. code-block:: python
 
@@ -207,7 +302,7 @@ Alternatively, you can train a model using the provided training scripts in the
       --config-path=examples/speechlm2/conf \
       --config-name=salm
 
-    # For inference/evaluation 
+    # For SALM inference/evaluation 
     python examples/speechlm2/salm_eval.py \
       pretrained_name=/path/to/checkpoint \
       inputs=/path/to/test_manifest \
@@ -222,9 +317,9 @@ Collection Structure
 
 The speechlm2 collection is organized into the following key components:
 
-- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, and SALM
-- **Modules**: Contains audio perception and speech generation modules
-- **Data**: Includes dataset classes and data loading utilities
+- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, SALM, DuplexEARTTS, and the inference-only NemotronVoiceChat.
+- **Modules**: Contains audio perception and speech generation modules.
+- **Data**: Includes dataset classes and data loading utilities.
 
 SpeechLM2 Documentation
 -----------------------
 
@@ -99,6 +99,24 @@ This model is particularly useful for:
 * Duplex systems where text responses are needed instead of speech
 * Applications requiring transcript generation from spoken dialogue
 
+
+NemotronVoiceChat
+^^^^^^^^^^^^^^^^^
+
+NemotronVoiceChat is an **inference-only**, end-to-end Duplex Speech-to-Speech pipeline. It achieves full-duplex conversational capabilities by seamlessly merging the `DuplexSTTModel` with the `DuplexEARTTS` model. 
+
+Because it is designed exclusively for evaluation, offline inference, and validation workflows (no training step is implemented), it is highly optimized for executing the full perception-generation-synthesis loop.
+
+Key components:
+
+* **DuplexSTTModel**: Handles the streaming audio perception and text response generation.
+* **DuplexEARTTS**: Serves as the autoregressive speech decoder, generating high-fidelity audio from the STT model's text tokens in a streamable fashion.
+
+This model is particularly useful for:
+* End-to-end evaluation of the complete speech-to-speech pipeline.
+* Offline speech-to-speech inference workflows.
+
+
 Model Components
 ----------------
 
@@ -247,6 +265,9 @@ All models in the speechlm2 collection can be instantiated from pretrained check
     # Load DuplexEARTTS
     ear_tts_model = slm.models.DuplexEARTTS.from_pretrained("path/to/checkpoint")
 
+    # Load NemotronVoiceChat (Inference Only)
+    voicechat_model = slm.models.NemotronVoiceChat.from_pretrained("path/to/checkpoint")
+
 Model Configuration
 -------------------
 
 
@@ -68,7 +68,7 @@ The input text is split into individual sentences using punctuation markers (``.
 Step 2: State Initialization
 ----------------------------
 
-A ``LongformChunkState`` object is created to track information across sentence chunks:
+A ``ChunkState`` object is created to track information across sentence chunks:
 
 - **History text tokens**: Text from previous chunks for context
 - **History encoder context**: Encoder outputs that provide continuity
@@ -112,7 +112,7 @@ Key Components
 
 1. **Sentence Splitting** (``split_by_sentence``): Intelligently splits text on sentence boundaries while handling abbreviations (e.g., "Dr.", "Mr.").
 
-2. **Chunk State** (``LongformChunkState``): Maintains context across chunks:
+2. **Chunk State** (``ChunkState``): Maintains context across chunks:
 
    - ``history_text``: Text tokens from previous chunks
    - ``history_context_tensor``: Encoder outputs for continuity
@@ -211,24 +211,24 @@ Configuration Dataclasses
 #########################
 
 
-``LongformConfig``
-------------------
+``ChunkedInferenceConfig``
+--------------------------
 
 Immutable tuning parameters (set in model):
 
 .. literalinclude:: ../../../nemo/collections/tts/models/magpietts.py
    :language: python
-   :pyobject: LongformConfig
+   :pyobject: ChunkedInferenceConfig
 
 
-``LongformChunkState``
-----------------------
+``ChunkState``
+--------------
 
 Mutable state passed between chunk iterations:
 
 .. literalinclude:: ../../../nemo/collections/tts/models/magpietts.py
    :language: python
-   :pyobject: LongformChunkState
+   :pyobject: ChunkState
 
 
 Best Practices
 
@@ -96,8 +96,8 @@ The final step is fine-tuning the base model on the preference pairs using the D
         max_epochs=10 \
         exp_manager.exp_dir=/path/to/dpo_experiment \
         exp_manager.checkpoint_callback_params.always_save_nemo=false \
-        model.train_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
-        model.validation_ds.dataset._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
+        model.train_ds.datasets._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
+        model.validation_ds.datasets._target_="nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDatasetDPO" \
         +train_ds_meta.dpopreftrain.manifest_path="/path/to/manifests/" \
         +train_ds_meta.dpopreftrain.audio_dir="/" \
         +train_ds_meta.dpopreftrain.feature_dir="/" \
 
@@ -0,0 +1,57 @@
+checkpoint_path: null # Path to the pre-trained NemotronVoiceChat checkpoint for evaluation
+model:
+  scoring_asr: stt_en_fastconformer_transducer_large # ASR model used to transcribe generated audio for ASR-BLEU computation
+  inference_speaker_reference: null # Path to an audio file used to clone/condition the TTS voice. Set to "null" if using a preset name below.
+  inference_speaker_name: Megan # Preset speaker identifier. If provided, this overrides `inference_speaker_reference`.
+  
+  stt:
+    model:
+      # evaluation params
+      eval_text_turn_taking: true # Enables evaluation of turn-taking and text prediction accuracy in the Duplex STT model
+
+  speech_generation:
+    model:
+      # inference params for the Duplex EAR-TTS module
+      inference_guidance_scale: 0.2 # Classifier-Free Guidance (CFG) scale for conditioning the audio generation
+      inference_noise_scale: 0.001 # Sampling temperature/noise for MoG
+      inference_top_p_or_k: 0.95 # Nucleus sampling (top-p) or top-k threshold for token selection
+      inference_guidance_enabled: true # Toggle to enable/disable Classifier-Free Guidance
+      inference_force_speech_silence_on_eos: true # Forces the model to output silence tokens once the End-Of-Sequence (EOS) token is generated
+
+trainer:
+  devices: -1 # Number of GPUs to use (-1 uses all available)
+  accelerator: gpu # Hardware accelerator type
+  num_nodes: 1 # Number of compute nodes
+  precision: 32 # Mixed precision setting (16-bit) for faster, memory-efficient inference
+  logger: False # Disabled here because NeMo's `exp_manager` handles logging
+  limit_val_batches: 1.0 # Fraction of the validation dataset to use (1.0 = use the entire dataset)
+  log_every_n_steps: 20 # Frequency of logging metrics to the console/wandb
+  use_distributed_sampler: false # Disable distributed sampler
+  strategy:
+    _target_: lightning.pytorch.strategies.DDPStrategy # Distributed Data Parallel strategy for multi-GPU inference
+    gradient_as_bucket_view: true # Memory optimization for DDP
+    find_unused_parameters: true # Required if parts of the model (like text-only branches) don't receive gradients/usage
+
+data:
+  frame_length: 0.08 # Duration of a single audio frame in seconds (80ms)
+  source_sample_rate: 16000 # Sample rate of the input/user audio prompts (16 kHz)
+  target_sample_rate: 22050 # Sample rate of the generated output speech (22.05 kHz)
+  input_roles: ["user", "User"] # Conversation roles mapped to the input prompt
+  output_roles: ["agent", "Assistant", "assistant","Agent"] # Conversation roles the model is tasked with generating
+
+  validation_ds:
+    datasets:
+      evaluation_set:
+        shar_path: /lustre/fsw/portfolios/llmservice/users/kevinhu/duplex/ultrachat_v2/shar_duplex/manifest_000020 # Path to the Lhotse WebDataset tar shards manifest
+
+    sample_rate: ${data.target_sample_rate} # Audio will be resampled to this rate if necessary
+    batch_size: 4 # Number of samples processed per GPU during evaluation
+    seed: 42 # Random seed for reproducibility
+    shard_seed: "randomized" # Ensures distributed workers get different data shards
+
+exp_manager:
+   explicit_log_dir: nemotron_voicechat_log_dir/ # Root directory where evaluation metrics, JSON logs, and generated audio will be saved
+   name: nemotron-voicechat-eval # Name of the experiment
+   create_tensorboard_logger: false # Toggle for TensorBoard logging
+   create_checkpoint_callback: false # Enables the checkpoint callback module
+   use_datetime_version: true # Appends a timestamp to the log directory name