NVIDIA-NeMo · blisc · Jan 26, 2026 · Jan 16, 2026 · Jan 22, 2026 · Jan 26, 2026
diff --git a/docs/source/tts/intro.rst b/docs/source/tts/intro.rst
@@ -16,5 +16,8 @@ We will illustrate details in the following sections.
     checkpoints
     configs
     g2p
+    magpietts
+    magpietts-po
+    magpietts-longform
 
 .. include:: resources.rst
diff --git a/docs/source/tts/magpietts-longform.rst b/docs/source/tts/magpietts-longform.rst
@@ -0,0 +1,270 @@
+.. _magpie-tts-longform:
+
+==============================
+Magpie-TTS Longform Inference
+==============================
+
+This document describes how longform (multi-sentence) text-to-speech inference works in Magpie-TTS.
+
+
+Overview
+########
+
+Magpie-TTS supports generating speech for long text inputs by processing them in smaller, sentence-level chunks while maintaining prosodic continuity across the entire utterance. This approach overcomes the context window limitations of the underlying transformer architecture.
+
+
+When Longform is Used
+#####################
+
+Longform inference is automatically triggered based on word count thresholds (approximately 20 seconds of audio):
+
+.. list-table:: Language Word Thresholds
+   :header-rows: 1
+   :widths: 30 30
+
+   * - Language
+     - Word Threshold
+   * - English
+     - 45 words
+   * - Spanish
+     - 73 words
+   * - French
+     - 69 words
+   * - German
+     - 50 words
+   * - Italian
+     - 53 words
+   * - Vietnamese
+     - 50 words
+
+.. note::
+
+   Longform is best supported for English. Mandarin currently falls back to standard inference.
+
+
+Algorithm
+#########
+
+The longform inference algorithm processes long text through the following steps:
+
+
+Step 1: Sentence Splitting
+--------------------------
+
+The input text is split into individual sentences using punctuation markers (``.``, ``?``, ``!``, ``...``). The splitting is intelligent and handles abbreviations like "Dr.", "Mr.", "a.m." by checking if the period is followed by a space.
+
+**Example:**
+
+::
+
+    Input:  "Dr. Smith arrived early. How are you today?"
+    Output: ["Dr. Smith arrived early.", "How are you today?"]
+
+
+Step 2: State Initialization
+----------------------------
+
+A ``LongformChunkState`` object is created to track information across sentence chunks:
+
+- **History text tokens**: Text from previous chunks for context
+- **History encoder context**: Encoder outputs that provide continuity
+- **Attention tracking**: Monitors which positions have been attended to
+
+
+Step 3: Iterative Chunk Processing
+----------------------------------
+
+For each sentence chunk, the following sub-steps are performed:
+
+1. **Context Preparation**: Prepend history text and encoder context from previous chunks to maintain prosodic continuity.
+
+2. **Attention Prior Application**: Apply a learned attention prior that guides the model to attend to the correct text positions, preventing repetition or skipping.
+
+3. **Autoregressive Generation**: Generate audio codes token-by-token using the transformer decoder with temperature sampling.
+
+4. **State Update**: Update the chunk state with:
+
+   - New history text (last N tokens)
+   - New encoder context
+   - Updated attention tracking
+
+5. **Code Collection**: Store the generated audio codes for this chunk.
+
+
+Step 4: Code Concatenation
+--------------------------
+
+After all chunks are processed, concatenate the audio codes from each chunk along the time dimension into a single sequence.
+
+
+Step 5: Audio Decoding
+----------------------
+
+Pass the concatenated codes through the neural audio codec decoder to produce the final waveform.
+
+
+Key Components
+--------------
+
+1. **Sentence Splitting** (``split_by_sentence``): Intelligently splits text on sentence boundaries while handling abbreviations (e.g., "Dr.", "Mr.").
+
+2. **Chunk State** (``LongformChunkState``): Maintains context across chunks:
+
+   - ``history_text``: Text tokens from previous chunks
+   - ``history_context_tensor``: Encoder outputs for continuity
+   - ``last_attended_timesteps``: Attention tracking for smooth transitions
+
+3. **Attention Prior**: Guides the model's attention to maintain proper alignment and prevent repetition/skipping.
+
+
+Usage
+#####
+
+
+Method 1: Using ``do_tts`` (Recommended for Simple Use Cases)
+-------------------------------------------------------------
+
+The ``do_tts`` method automatically detects whether longform inference is needed:
+
+.. code-block:: python
+
+    import torch
+    from nemo.collections.tts.models import MagpieTTSModel
+
+    # Load model
+    model = MagpieTTSModel.restore_from("path/to/magpietts.nemo")
+    model.eval()
+    model.cuda()
+
+    # Short text - uses standard inference automatically
+    short_audio, short_len = model.do_tts(
+        transcript="Hello, how are you?",
+        language="en",
+    )
+
+    # Long text - automatically switches to longform inference
+    long_text = """
+    The quick brown fox jumps over the lazy dog. This sentence contains every 
+    letter of the alphabet. Sphinx of black quartz, judge my vow. Pack my box 
+    with five dozen liquor jugs. How vexingly quick daft zebras jump. The five 
+    boxing wizards jump quickly. Jackdaws love my big sphinx of quartz.
+    """
+
+    long_audio, long_len = model.do_tts(
+        transcript=long_text,
+        language="en",
+        apply_TN=True,  # Apply text normalization
+        temperature=0.7,
+        topk=80,
+        use_cfg=True,
+        cfg_scale=2.5,
+    )
+
+    # Save audio
+    import soundfile as sf
+    sf.write("output.wav", long_audio[0].cpu().numpy(), 22050)
+
+
+Method 2: Using CLI (``magpietts_inference.py``)
+------------------------------------------------
+
+For batch inference from manifests:
+
+.. code-block:: bash
+
+    # Auto-detect longform based on text length (default)
+    python examples/tts/magpietts_inference.py \
+        --nemo_files /path/to/magpietts.nemo \
+        --datasets_json_path /path/to/evalset_config.json \
+        --out_dir /path/to/output \
+        --codecmodel_path /path/to/codec.nemo \
+        --longform_mode auto
+
+    # Force longform inference for all inputs
+    python examples/tts/magpietts_inference.py \
+        --nemo_files /path/to/magpietts.nemo \
+        --datasets_json_path /path/to/evalset_config.json \
+        --out_dir /path/to/output \
+        --codecmodel_path /path/to/codec.nemo \
+        --longform_mode always \
+        --longform_max_decoder_steps 50000
+
+**Longform CLI Options:**
+
+.. list-table::
+   :header-rows: 1
+   :widths: 25 15 60
+
+   * - Option
+     - Default
+     - Description
+   * - ``--longform_mode``
+     - ``auto``
+     - ``auto``: detect from text, ``always``: force longform, ``never``: disable
+
+
+Configuration Dataclasses
+#########################
+
+
+``LongformConfig``
+------------------
+
+Immutable tuning parameters (set in model):
+
+.. code-block:: python
+
+    @dataclass
+    class LongformConfig:
+        history_len_heuristic: int = 20      # Max history tokens retained
+        prior_weights_init: Tuple = (0.5, 1.0, 0.8, 0.2, 0.2)  # Initial attention weights
+        prior_weights: Tuple = (0.2, 1.0, 0.6, 0.4, 0.2, 0.2)  # Generation weights
+        finished_limit_with_eot: int = 5     # Steps after text end before EOS
+        short_sentence_threshold: int = 35   # Skip prior for short sentences
+        attention_sink_threshold: int = 10   # Attention sink detection
+
+
+``LongformChunkState``
+----------------------
+
+Mutable state passed between chunk iterations:
+
+.. code-block:: python
+
+    @dataclass
+    class LongformChunkState:
+        batch_size: int
+        history_text: Optional[torch.Tensor] = None       # (B, T)
+        history_text_lens: Optional[torch.Tensor] = None  # (B,)
+        history_context_tensor: Optional[torch.Tensor] = None  # (B, T, E)
+        end_indices: Dict[int, int] = field(default_factory=dict)
+        overall_idx: int = 0
+        left_offset: List[int] = field(default_factory=list)
+        last_attended_timesteps: List[List[int]] = field(default_factory=list)
+
+
+Best Practices
+##############
+
+1. **Use ``apply_TN=True``** for raw text to ensure proper normalization before synthesis.
+
+2. **Increase ``max_decoder_steps``** for very long texts (default 50000 is usually sufficient).
+
+3. **Use ``longform_mode="auto"``** (default) to let the system decide based on text length.
+
+4. **For non-English languages**, be aware that longform performance may vary. English is best supported.
+
+
+Limitations
+###########
+
+- **Mandarin (zh)**: Currently falls back to standard inference due to character-based tokenization complexities.
+- **Prosodic boundaries**: While the algorithm maintains continuity, natural paragraph breaks may not always be perfectly preserved in non-English languages.
+
+
+See Also
+########
+
+- :doc:`magpietts`: Main Magpie-TTS documentation
+- :doc:`magpietts-po`: Preference Optimization Guide
+