Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/source/tts/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,8 @@ We will illustrate details in the following sections.
checkpoints
configs
g2p
magpietts
magpietts-po
magpietts-longform

.. include:: resources.rst
270 changes: 270 additions & 0 deletions docs/source/tts/magpietts-longform.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
.. _magpie-tts-longform:

==============================
Magpie-TTS Longform Inference
==============================

This document describes how longform (multi-sentence) text-to-speech inference works in Magpie-TTS.


Overview
########

Magpie-TTS supports generating speech for long text inputs by processing them in smaller, sentence-level chunks while maintaining prosodic continuity across the entire utterance. This approach overcomes the context window limitations of the underlying transformer architecture.


When Longform is Used
#####################

Longform inference is automatically triggered based on word count thresholds (approximately 20 seconds of audio):

.. list-table:: Language Word Thresholds
:header-rows: 1
:widths: 30 30

* - Language
- Word Threshold
* - English
- 45 words
* - Spanish
- 73 words
* - French
- 69 words
* - German
- 50 words
* - Italian
- 53 words
* - Vietnamese
- 50 words

.. note::

Longform is best supported for English. Mandarin currently falls back to standard inference.


Algorithm
#########

The longform inference algorithm processes long text through the following steps:


Step 1: Sentence Splitting
--------------------------

The input text is split into individual sentences using punctuation markers (``.``, ``?``, ``!``, ``...``). The splitting is intelligent and handles abbreviations like "Dr.", "Mr.", "a.m." by checking if the period is followed by a space.

**Example:**

::

Input: "Dr. Smith arrived early. How are you today?"
Output: ["Dr. Smith arrived early.", "How are you today?"]


Step 2: State Initialization
----------------------------

A ``LongformChunkState`` object is created to track information across sentence chunks:

- **History text tokens**: Text from previous chunks for context
- **History encoder context**: Encoder outputs that provide continuity
- **Attention tracking**: Monitors which positions have been attended to


Step 3: Iterative Chunk Processing
----------------------------------

For each sentence chunk, the following sub-steps are performed:

1. **Context Preparation**: Prepend history text and encoder context from previous chunks to maintain prosodic continuity.

2. **Attention Prior Application**: Apply a learned attention prior that guides the model to attend to the correct text positions, preventing repetition or skipping.

3. **Autoregressive Generation**: Generate audio codes token-by-token using the transformer decoder with temperature sampling.

4. **State Update**: Update the chunk state with:

- New history text (last N tokens)
- New encoder context
- Updated attention tracking

5. **Code Collection**: Store the generated audio codes for this chunk.


Step 4: Code Concatenation
--------------------------

After all chunks are processed, concatenate the audio codes from each chunk along the time dimension into a single sequence.


Step 5: Audio Decoding
----------------------

Pass the concatenated codes through the neural audio codec decoder to produce the final waveform.


Key Components
--------------

1. **Sentence Splitting** (``split_by_sentence``): Intelligently splits text on sentence boundaries while handling abbreviations (e.g., "Dr.", "Mr.").

2. **Chunk State** (``LongformChunkState``): Maintains context across chunks:

- ``history_text``: Text tokens from previous chunks
- ``history_context_tensor``: Encoder outputs for continuity
- ``last_attended_timesteps``: Attention tracking for smooth transitions

3. **Attention Prior**: Guides the model's attention to maintain proper alignment and prevent repetition/skipping.


Usage
#####


Method 1: Using ``do_tts`` (Recommended for Simple Use Cases)
-------------------------------------------------------------

The ``do_tts`` method automatically detects whether longform inference is needed:

.. code-block:: python

import torch
from nemo.collections.tts.models import MagpieTTSModel

# Load model
model = MagpieTTSModel.restore_from("path/to/magpietts.nemo")
model.eval()
model.cuda()

# Short text - uses standard inference automatically
short_audio, short_len = model.do_tts(
transcript="Hello, how are you?",
language="en",
)

# Long text - automatically switches to longform inference
long_text = """
The quick brown fox jumps over the lazy dog. This sentence contains every
letter of the alphabet. Sphinx of black quartz, judge my vow. Pack my box
with five dozen liquor jugs. How vexingly quick daft zebras jump. The five
boxing wizards jump quickly. Jackdaws love my big sphinx of quartz.
"""

long_audio, long_len = model.do_tts(
transcript=long_text,
language="en",
apply_TN=True, # Apply text normalization
temperature=0.7,
topk=80,
use_cfg=True,
cfg_scale=2.5,
)

# Save audio
import soundfile as sf
sf.write("output.wav", long_audio[0].cpu().numpy(), 22050)


Method 2: Using CLI (``magpietts_inference.py``)
------------------------------------------------

For batch inference from manifests:

.. code-block:: bash

# Auto-detect longform based on text length (default)
python examples/tts/magpietts_inference.py \
--nemo_files /path/to/magpietts.nemo \
--datasets_json_path /path/to/evalset_config.json \
--out_dir /path/to/output \
--codecmodel_path /path/to/codec.nemo \
--longform_mode auto

# Force longform inference for all inputs
python examples/tts/magpietts_inference.py \
--nemo_files /path/to/magpietts.nemo \
--datasets_json_path /path/to/evalset_config.json \
--out_dir /path/to/output \
--codecmodel_path /path/to/codec.nemo \
--longform_mode always \
--longform_max_decoder_steps 50000

**Longform CLI Options:**

.. list-table::
:header-rows: 1
:widths: 25 15 60

* - Option
- Default
- Description
* - ``--longform_mode``
- ``auto``
- ``auto``: detect from text, ``always``: force longform, ``never``: disable


Configuration Dataclasses
#########################


``LongformConfig``
------------------

Immutable tuning parameters (set in model):

.. code-block:: python

@dataclass
class LongformConfig:
history_len_heuristic: int = 20 # Max history tokens retained
prior_weights_init: Tuple = (0.5, 1.0, 0.8, 0.2, 0.2) # Initial attention weights
prior_weights: Tuple = (0.2, 1.0, 0.6, 0.4, 0.2, 0.2) # Generation weights
finished_limit_with_eot: int = 5 # Steps after text end before EOS
short_sentence_threshold: int = 35 # Skip prior for short sentences
attention_sink_threshold: int = 10 # Attention sink detection


``LongformChunkState``
----------------------

Mutable state passed between chunk iterations:

.. code-block:: python
Comment thread
blisc marked this conversation as resolved.
Outdated

@dataclass
class LongformChunkState:
batch_size: int
history_text: Optional[torch.Tensor] = None # (B, T)
history_text_lens: Optional[torch.Tensor] = None # (B,)
history_context_tensor: Optional[torch.Tensor] = None # (B, T, E)
end_indices: Dict[int, int] = field(default_factory=dict)
overall_idx: int = 0
left_offset: List[int] = field(default_factory=list)
last_attended_timesteps: List[List[int]] = field(default_factory=list)


Best Practices
##############

1. **Use ``apply_TN=True``** for raw text to ensure proper normalization before synthesis.

2. **Increase ``max_decoder_steps``** for very long texts (default 50000 is usually sufficient).

3. **Use ``longform_mode="auto"``** (default) to let the system decide based on text length.

4. **For non-English languages**, be aware that longform performance may vary. English is best supported.


Limitations
###########

- **Mandarin (zh)**: Currently falls back to standard inference due to character-based tokenization complexities.
- **Prosodic boundaries**: While the algorithm maintains continuity, natural paragraph breaks may not always be perfectly preserved in non-English languages.


See Also
########

- :doc:`magpietts`: Main Magpie-TTS documentation
- :doc:`magpietts-po`: Preference Optimization Guide

Loading
Loading