Skip to content

[MagpieTTS] Magpietts longform unify#15477

Merged
subhankar-ghosh merged 34 commits intomainfrom
magpietts_longform_unify
Mar 11, 2026
Merged

[MagpieTTS] Magpietts longform unify#15477
subhankar-ghosh merged 34 commits intomainfrom
magpietts_longform_unify

Conversation

@subhankar-ghosh
Copy link
Collaborator

@subhankar-ghosh subhankar-ghosh commented Mar 9, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

subhankar-ghosh and others added 29 commits February 2, 2026 09:51
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Subhankar Ghosh <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
for PR it is important to merge main to this branch, as it is out of date.
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Added regex to remove spaces in Japanese transcripts as a workaround for a bug in the Ja normalizer.

Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
@github-actions
Copy link
Contributor

[🤖]: Hi @subhankar-ghosh 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@subhankar-ghosh subhankar-ghosh enabled auto-merge (squash) March 10, 2026 16:50
Copy link
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but please see comments.
Also, I earlier confirmed that CERs for the frame stacking model look good after this change (same as pre-unification), so it looks like this fixes the issue we observed.


predicted_codes = torch.cat(state.all_predictions, dim=-1) # (B, C, F*T_steps)
num_steps = len(state.all_predictions)
default_frame_len = num_steps * self.frame_stacking_factor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add back the comment that was here originally, I think it got lost:
# Concatenate the list of predictions along the time dimension. Note that when frame stacking is on, this also undoes the stacking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

finished_texts_counter={},
attn_prior=initial_attn_prior,
)
chunk_end_frame_lens: Dict[int, int] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment saying what this tracks and if it persists across chunked calls to generate_speech()? Because it appears that this one keeps state locally, unlike chunk_state which is persistent between calls but not super clear from the naming.

Side note, maybe we could find a better name than chunk_state since that structure appears not to be associated with a particular chunk but rather tracks overall inference state (I think). E.g. could call it inference_state or chunked_inference_state (the latter is admittedly kind of verbose).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. It does not maintain state across generate_speech calls. It is local only.

Args:
chunk_state: Mutable state object tracking history across chunks.
audio_codes_next: Sampled audio codes. Shape: (B, num_codebooks).
audio_codes_next: Sampled audio codes. Shape: (B, num_codebooks) or (B, num_codebooks, frame_stacking_factor).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it always 3-dimensional, with frame_stacking_factor being 1 if there's no frame stacking?

subhankar-ghosh and others added 3 commits March 10, 2026 23:46
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
@github-actions
Copy link
Contributor

[🤖]: Hi @subhankar-ghosh 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@github-actions github-actions bot removed the Run CICD label Mar 11, 2026
Copy link
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@subhankar-ghosh subhankar-ghosh merged commit 80062ee into main Mar 11, 2026
131 checks passed
@subhankar-ghosh subhankar-ghosh deleted the magpietts_longform_unify branch March 11, 2026 18:11
nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
* Refactor audio processing to include frame lengths for Framestacking

Signed-off-by: Subhankar Ghosh <[email protected]>

* Adding back Fix Japanese transcript normalization issue

Added regex to remove spaces in Japanese transcripts as a workaround for a bug in the Ja normalizer.

Signed-off-by: Subhankar Ghosh <[email protected]>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <[email protected]>

---------

Signed-off-by: Subhankar Ghosh <[email protected]>
Signed-off-by: subhankar-ghosh <[email protected]>
Co-authored-by: subhankar-ghosh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants