Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support by ealbasiri · Pull Request #14561 · NVIDIA-NeMo/NeMo

ealbasiri · 2025-08-22T22:16:57Z

Note: This is a reopened version of #13360 with all reviewer feedback addressed:

Fixed use_cer configuration issue (changed from wer.use_cer to use_cer)
Added comprehensive prompt model documentation to docs/
Rebased with latest main branch

What does this PR do?

This PR adds support for Hybrid RNNT-CTC BPE Model with Prompt Feature (EncDecHybridRNNTCTCBPEModelWithPrompt), enabling flexible ASR and AST tasks through prompt-based conditioning.

Key Features

Architecture: Hybrid RNNT-CTC model with prompt vector conditioning
- Prompt vector (one-hot encoded) is concatenated to ASR embeddings from FastConformer
- Concatenated vector is fed into decoder for prompt-aware processing
Tasks: model supports both ASR and AST
Inference Modes: Supports both buffered streaming and offline inference
Scalable Design: Can support multilingual ASR and AST tasks

Prompt-Based Conditioning

Target language prompt: Required input that conditions the model behavior
Source language detection: Not required - model automatically handles source language
Task determination:
- Same source/target → ASR (transcription)
- Different source/target → AST (translation)
Supported languages: multilingual ASR/AST

Benefits

Single model deployment: One model handles multiple languages and tasks
Prompt-driven flexibility: No need for separate models per language pair
Word-level timestamps: Applicable for aligned dataset generation
inference: Streaming capabilities with prompt conditioning

Collection: ASR

Changelog

Add EncDecHybridRNNTCTCBPEModelWithPrompt model with prompt conditioning
Add prompt vector support in Lhotse dataloader
Add offline and streaming inference
Add documentation for prompt-based model usage
Add unit tests for prompt model

Usage

Training

python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py \
  --config-path=/config/ \
  --config-name=fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml

Offline Inference

python examples/asr/transcribe_speech.py \
  model_path="/path/to/model.nemo" \
  dataset_manifest="/path/to/manifest.json" \
  output_filename="/path/to/output.json" \
  batch_size=32 \
  cuda=0 \
  amp=True \
  decoder_type=rnnt

Model Usage

from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModelWithPrompt

model = EncDecHybridRNNTCTCBPEModelWithPrompt.restore_from("path/to/model.nemo")
# Prompt-based transcription/translation
results = model.transcribe(audio_files, target_language="en")

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
[N/A] Does the PR affect components that are optional to install?

PR Type:

New Feature
Bugfix
Documentation

Who can review?

@nithinraok @anhnami
Anyone in the community is free to review the PR once the checks have passed.

Additional Information

This model enables efficient multilingual ASR/AST through prompt conditioning, eliminating the need for multiple specialized models while maintaining high performance across languages and tasks.

anhnguyen-namitech · 2025-08-28T17:21:41Z

nemo/collections/asr/models/hybrid_rnnt_ctc_bpe_models_prompt.py

+        proj_out_size = self._cfg.model_defaults.enc_hidden
+
+        self.prompt_kernel = torch.nn.Sequential(
+            torch.nn.Linear(proj_in_size, proj_out_size * 2),


Correct me if I'm misunderstood your implementation.

Instead of constructing the prompt vector in the dataloader and add a bunch of extra arguments, you may want to split the first layer of this MLP into enc_hidden_prj and prompt_linear_prj. Let's say the prompt does not change in each sequence (no code-switching), the equivalent computation to your current implementation is:

linear(relu(enc_hidden_prj(enc) + prompt_linear_prj(prompt_vector)))

where prompt_linear_prj(prompt_vector) does not need to be repeated over time-frame dimension due to broadcasting. If the prompt_vector is one hot, prompt_linear_prj(prompt_vector) is a row of the weight matrix, in other word, this is FiLM condtioning with unit scale vector and a task-dependent shift vector. You can activate multiple tasks at the same time if prompt vector is two-hot or something (e.g., X->En and PnC). You may get a little bit better performance with full FiLM conditioning with skip connection like the below:

linear(relu(enc_hidden_prj(enc) * (1 + prompt_scale_prj(prompt_vector)) + prompt_shift_prj(prompt_vector)))

very good suggestion, i'm actually now working on testing code-switching within a single utterance. I'm thinking of making these type of code-switching language agnostic and not pass any lang id. I will try your suggestion.

nithinraok

LGTM. Resolve CI-CD issues then its good to merge.

… inferance pipeline Signed-off-by: Enas Albasiri <[email protected]>

Signed-off-by: Enas Albasiri <[email protected]>

Signed-off-by: ealbasiri <[email protected]> Signed-off-by: Enas Albasiri <[email protected]>

Signed-off-by: Enas Albasiri <[email protected]>

nithinraok · 2025-09-24T14:37:42Z

tests/functional_tests/L0_Unit_Tests_GPU_ASR_Hybrid_Prompt.sh

+python -c "from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModelWithPrompt" && \
+NEMO_NUMBA_MINVER=0.53 CUDA_VISIBLE_DEVICES=0 \
+coverage run -a --data-file=/workspace/.coverage --source=/workspace/ \
+-m pytest tests/collections/asr/test_asr_hybrid_rnnt_ctc_model_bpe_prompt.py \


I just noticed. These are not why CI-CD runs are present, see how example script was run to check for this model here: https://github.com/ealbasiri/NeMo/blob/hybrid-parakeet-tgt-lang-apr30/tests/functional_tests/ASR_dev_run_Speech_to_Text_WPE_-_Conformer.sh

Signed-off-by: Enas Albasiri <[email protected]>

…/NeMo into hybrid-parakeet-tgt-lang-apr30

nithinraok

Great work. LGTM Thanks

github-actions bot added the ASR label Aug 22, 2025

anhnguyen-namitech reviewed Aug 28, 2025

View reviewed changes

nithinraok added the Run CICD label Sep 3, 2025

nithinraok reviewed Sep 3, 2025

View reviewed changes

chtruong814 added Run CICD and removed Run CICD labels Sep 3, 2025

ealbasiri force-pushed the hybrid-parakeet-tgt-lang-apr30 branch from 172017f to d72a1af Compare September 8, 2025 20:09

github-actions bot added NLP audio labels Sep 8, 2025

chtruong814 added Run CICD and removed Run CICD labels Sep 8, 2025

github-actions bot removed the NLP label Sep 8, 2025

chtruong814 added Run CICD and removed Run CICD labels Sep 8, 2025

Enas Albasiri and others added 16 commits September 8, 2025 20:52

Add hybrid parakeet with target language ID modelssupport and offline…

c2f033f

… inferance pipeline Signed-off-by: Enas Albasiri <[email protected]>

formatted Target Lang Parakeet model support and offline pipeline

ce41a33

Signed-off-by: Enas Albasiri <[email protected]>

add example use for Parakeet AST hybrid transducer CTC

09863cb

Signed-off-by: Enas Albasiri <[email protected]>

PR revision integrated

b025d56

Signed-off-by: Enas Albasiri <[email protected]>

add sample config file to target lang ID

3b07d21

Signed-off-by: Enas Albasiri <[email protected]>

add straming iferacne support for RNNT with target lang ID support

1b72679

Signed-off-by: Enas Albasiri <[email protected]>

update streaming_utils-- rebase

727e081

Signed-off-by: Enas Albasiri <[email protected]>

modifed Parakeet with target lang to Parakeet with prompt

115679e

Signed-off-by: Enas Albasiri <[email protected]>

added unit tests and modifed files to reflect revisions

dc91804

Signed-off-by: Enas Albasiri <[email protected]>

added transcribe function to the model and test for it

6891e25

Signed-off-by: Enas Albasiri <[email protected]>

added CI-CD run test and timestamps test

e3cf863

Signed-off-by: Enas Albasiri <[email protected]>

Apply isort and black reformatting

38ef33d

Signed-off-by: ealbasiri <[email protected]> Signed-off-by: Enas Albasiri <[email protected]>

fix CodeQL failing tests

dac1b9d

Signed-off-by: Enas Albasiri <[email protected]>

Fix empty f-string issue in audio_to_text_lhotse_prompt

4e87960

Signed-off-by: Enas Albasiri <[email protected]>

keep transcription.py without changes

50de0ce

Signed-off-by: Enas Albasiri <[email protected]>

keep transcribe_speech no change

4694ca4

Signed-off-by: Enas Albasiri <[email protected]>

Resolve merge conflicts with main

4d50bda

Signed-off-by: Enas Albasiri <[email protected]>

github-actions bot removed TTS NLP CI labels Sep 22, 2025

chtruong814 added Run CICD and removed Run CICD labels Sep 22, 2025

ko3n1g added skip-linting Run CICD and removed Run CICD labels Sep 23, 2025

ko3n1g temporarily deployed to test September 23, 2025 15:49 — with GitHub Actions Inactive

nithinraok requested changes Sep 24, 2025

View reviewed changes

added functional_tests/ASR_dev_run_Speech_to_Text_Hybrid_RNNT_CTC_Prompt

804ac67

Signed-off-by: Enas Albasiri <[email protected]>

github-actions bot added the CI label Oct 15, 2025

chtruong814 added Run CICD and removed Run CICD labels Oct 15, 2025

Merge branch 'main' into hybrid-parakeet-tgt-lang-apr30

f92cdf0

Signed-off-by: Enas Albasiri <[email protected]>

chtruong814 added Run CICD and removed Run CICD labels Oct 15, 2025

ko3n1g previously approved these changes Oct 15, 2025

View reviewed changes

chtruong814 temporarily deployed to test October 15, 2025 19:52 — with GitHub Actions Inactive

ealbasiri added 2 commits October 16, 2025 14:40

updated file paths in functional test

69c9389

Signed-off-by: Enas Albasiri <[email protected]>

Merge branch 'hybrid-parakeet-tgt-lang-apr30' of github.com:ealbasiri…

efa4526

…/NeMo into hybrid-parakeet-tgt-lang-apr30

ealbasiri dismissed ko3n1g’s stale review via efa4526 October 16, 2025 14:52

chtruong814 added Run CICD and removed Run CICD labels Oct 16, 2025

ko3n1g approved these changes Oct 17, 2025

View reviewed changes

nithinraok approved these changes Oct 17, 2025

View reviewed changes

chtruong814 temporarily deployed to test October 17, 2025 16:25 — with GitHub Actions Inactive

ko3n1g merged commit 39bd67a into NVIDIA-NeMo:main Oct 18, 2025
167 of 169 checks passed

nithinraok mentioned this pull request Oct 20, 2025

Hybrid RNNT-CTC Prompted Parakeet Model support #14954

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support#14561

Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support#14561
ko3n1g merged 59 commits intoNVIDIA-NeMo:mainfrom
ealbasiri:hybrid-parakeet-tgt-lang-apr30

ealbasiri commented Aug 22, 2025

Uh oh!

anhnguyen-namitech Aug 28, 2025 •

edited

Loading

Uh oh!

ealbasiri Sep 24, 2025

Uh oh!

nithinraok left a comment

Uh oh!

nithinraok Sep 24, 2025

Uh oh!

nithinraok left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ealbasiri commented Aug 22, 2025

What does this PR do?

Key Features

Prompt-Based Conditioning

Benefits

Changelog

Usage

Training

Offline Inference

Model Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

anhnguyen-namitech Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ealbasiri Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

nithinraok left a comment

Choose a reason for hiding this comment

Uh oh!

nithinraok Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

nithinraok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

anhnguyen-namitech Aug 28, 2025 •

edited

Loading