Skip to content

Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support#14561

Merged
ko3n1g merged 59 commits intoNVIDIA-NeMo:mainfrom
ealbasiri:hybrid-parakeet-tgt-lang-apr30
Oct 18, 2025
Merged

Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support#14561
ko3n1g merged 59 commits intoNVIDIA-NeMo:mainfrom
ealbasiri:hybrid-parakeet-tgt-lang-apr30

Conversation

@ealbasiri
Copy link
Contributor

Note: This is a reopened version of #13360 with all reviewer feedback addressed:

  • Fixed use_cer configuration issue (changed from wer.use_cer to use_cer)
  • Added comprehensive prompt model documentation to docs/
  • Rebased with latest main branch

What does this PR do?

This PR adds support for Hybrid RNNT-CTC BPE Model with Prompt Feature (EncDecHybridRNNTCTCBPEModelWithPrompt), enabling flexible ASR and AST tasks through prompt-based conditioning.

Key Features

  • Architecture: Hybrid RNNT-CTC model with prompt vector conditioning
    • Prompt vector (one-hot encoded) is concatenated to ASR embeddings from FastConformer
    • Concatenated vector is fed into decoder for prompt-aware processing
  • Tasks: model supports both ASR and AST
  • Inference Modes: Supports both buffered streaming and offline inference
  • Scalable Design: Can support multilingual ASR and AST tasks

Prompt-Based Conditioning

  • Target language prompt: Required input that conditions the model behavior
  • Source language detection: Not required - model automatically handles source language
  • Task determination:
    • Same source/target → ASR (transcription)
    • Different source/target → AST (translation)
  • Supported languages: multilingual ASR/AST

Benefits

  • Single model deployment: One model handles multiple languages and tasks
  • Prompt-driven flexibility: No need for separate models per language pair
  • Word-level timestamps: Applicable for aligned dataset generation
  • inference: Streaming capabilities with prompt conditioning

Collection: ASR

Changelog

  • Add EncDecHybridRNNTCTCBPEModelWithPrompt model with prompt conditioning
  • Add prompt vector support in Lhotse dataloader
  • Add offline and streaming inference
  • Add documentation for prompt-based model usage
  • Add unit tests for prompt model

Usage

Training

python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py \
  --config-path=/config/ \
  --config-name=fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml

Offline Inference

python examples/asr/transcribe_speech.py \
  model_path="/path/to/model.nemo" \
  dataset_manifest="/path/to/manifest.json" \
  output_filename="/path/to/output.json" \
  batch_size=32 \
  cuda=0 \
  amp=True \
  decoder_type=rnnt

Model Usage

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModelWithPrompt

model = EncDecHybridRNNTCTCBPEModelWithPrompt.restore_from("path/to/model.nemo")
# Prompt-based transcription/translation
results = model.transcribe(audio_files, target_language="en")

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • [N/A] Does the PR affect components that are optional to install?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

@nithinraok @anhnami
Anyone in the community is free to review the PR once the checks have passed.

Additional Information

This model enables efficient multilingual ASR/AST through prompt conditioning, eliminating the need for multiple specialized models while maintaining high performance across languages and tasks.

@github-actions github-actions bot added the ASR label Aug 22, 2025
proj_out_size = self._cfg.model_defaults.enc_hidden

self.prompt_kernel = torch.nn.Sequential(
torch.nn.Linear(proj_in_size, proj_out_size * 2),
Copy link

@anhnguyen-namitech anhnguyen-namitech Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm misunderstood your implementation.

Instead of constructing the prompt vector in the dataloader and add a bunch of extra arguments, you may want to split the first layer of this MLP into enc_hidden_prj and prompt_linear_prj. Let's say the prompt does not change in each sequence (no code-switching), the equivalent computation to your current implementation is:

linear(relu(enc_hidden_prj(enc) + prompt_linear_prj(prompt_vector)))

where prompt_linear_prj(prompt_vector) does not need to be repeated over time-frame dimension due to broadcasting. If the prompt_vector is one hot, prompt_linear_prj(prompt_vector) is a row of the weight matrix, in other word, this is FiLM condtioning with unit scale vector and a task-dependent shift vector. You can activate multiple tasks at the same time if prompt vector is two-hot or something (e.g., X->En and PnC). You may get a little bit better performance with full FiLM conditioning with skip connection like the below:

linear(relu(enc_hidden_prj(enc) * (1 + prompt_scale_prj(prompt_vector)) + prompt_shift_prj(prompt_vector)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good suggestion, i'm actually now working on testing code-switching within a single utterance. I'm thinking of making these type of code-switching language agnostic and not pass any lang id. I will try your suggestion.

Copy link
Member

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Resolve CI-CD issues then its good to merge.

Enas Albasiri and others added 16 commits September 8, 2025 20:52
Signed-off-by: Enas Albasiri <[email protected]>
Signed-off-by: Enas Albasiri <[email protected]>
Signed-off-by: ealbasiri <[email protected]>
Signed-off-by: Enas Albasiri <[email protected]>
Signed-off-by: Enas Albasiri <[email protected]>
python -c "from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModelWithPrompt" && \
NEMO_NUMBA_MINVER=0.53 CUDA_VISIBLE_DEVICES=0 \
coverage run -a --data-file=/workspace/.coverage --source=/workspace/ \
-m pytest tests/collections/asr/test_asr_hybrid_rnnt_ctc_model_bpe_prompt.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed. These are not why CI-CD runs are present, see how example script was run to check for this model here: https://github.com/ealbasiri/NeMo/blob/hybrid-parakeet-tgt-lang-apr30/tests/functional_tests/ASR_dev_run_Speech_to_Text_WPE_-_Conformer.sh

ko3n1g
ko3n1g previously approved these changes Oct 15, 2025
Copy link
Member

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. LGTM Thanks

@ko3n1g ko3n1g merged commit 39bd67a into NVIDIA-NeMo:main Oct 18, 2025
167 of 169 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.