Skip to content
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
12f72f2
update streaming ASR
stevehuang52 Jul 23, 2025
e4f5663
add voice agent
stevehuang52 Jul 23, 2025
fda5450
update readme
stevehuang52 Jul 23, 2025
f843762
update websocket
stevehuang52 Jul 24, 2025
16a27ba
update
stevehuang52 Jul 24, 2025
94b43bc
update
stevehuang52 Jul 24, 2025
6ff5302
update readme
stevehuang52 Jul 24, 2025
b45cb1a
update
stevehuang52 Jul 24, 2025
6c21c77
clean up
stevehuang52 Jul 24, 2025
6118ac9
clean up
stevehuang52 Jul 24, 2025
7e1e62a
fix typo
stevehuang52 Jul 24, 2025
ab723a0
fix codeQL
stevehuang52 Jul 24, 2025
af9b523
update cfg
stevehuang52 Jul 25, 2025
08ba934
remove unused
stevehuang52 Jul 25, 2025
d23e113
update readme
stevehuang52 Jul 25, 2025
de7cacc
change default models
stevehuang52 Jul 25, 2025
3711938
fix diar diable
stevehuang52 Jul 25, 2025
9200a52
fix diar diable
stevehuang52 Jul 25, 2025
af315bd
update ux
stevehuang52 Jul 26, 2025
c22f41e
update tts
stevehuang52 Jul 26, 2025
d2451b4
update readme
stevehuang52 Jul 26, 2025
ff705be
fix and update
stevehuang52 Jul 28, 2025
79f7996
fix asr
stevehuang52 Jul 28, 2025
04e6bb6
update readmme
stevehuang52 Jul 28, 2025
a0df9f3
update doc and llm dtype
stevehuang52 Jul 29, 2025
5c51c29
refactor and add example prompts
stevehuang52 Jul 29, 2025
025244d
update readme
stevehuang52 Jul 29, 2025
216278a
update readme
stevehuang52 Jul 29, 2025
72d0c67
clean up
stevehuang52 Jul 29, 2025
149487e
clean up
stevehuang52 Aug 1, 2025
be178a6
Merge branch 'main' into heh/nemo_voice
stevehuang52 Aug 4, 2025
9c14702
update info on streaming sortformer
stevehuang52 Aug 19, 2025
de1b138
move code to 'nemo/agents/voice_agent'
stevehuang52 Aug 20, 2025
4997feb
update doc
stevehuang52 Aug 21, 2025
f31bd13
clean up
stevehuang52 Aug 25, 2025
5cc5e26
refactor
stevehuang52 Sep 2, 2025
f56f9fc
Merge branch 'main' into heh/nemo_voice
tango4j Sep 3, 2025
55f8191
Merge branch 'main' into heh/nemo_voice
KunalDhawan Sep 4, 2025
b22a150
update doc
stevehuang52 Sep 4, 2025
6c1019a
Merge branch 'heh/nemo_voice' of https://github.com/NVIDIA/NeMo into …
stevehuang52 Sep 4, 2025
98006c0
remove the unnecessary streaming state conversion and import it from …
weiqingw4ng Sep 4, 2025
e5665b0
Apply isort and black reformatting
weiqingw4ng Sep 4, 2025
e23935d
update doc
stevehuang52 Sep 4, 2025
9e9ed7d
Merge branch 'heh/nemo_voice' of https://github.com/NVIDIA/NeMo into …
stevehuang52 Sep 4, 2025
f752abc
clean up
stevehuang52 Sep 4, 2025
1ffddb2
fix for llama-nemotron template, and refactor
stevehuang52 Sep 4, 2025
500b396
fix tts separator
stevehuang52 Sep 4, 2025
a59733c
fix for llama-nemotron
stevehuang52 Sep 5, 2025
90cbfc1
update cfg
stevehuang52 Sep 5, 2025
30a55bc
refactor and update doc
stevehuang52 Sep 5, 2025
de148ae
change default llm to qwen
stevehuang52 Sep 5, 2025
f3572f7
update doc
stevehuang52 Sep 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,3 +181,6 @@ examples/neural_graphs/*.yml
nemo_experiments/

slurm*.out

node_modules/
.vite/
175 changes: 175 additions & 0 deletions examples/voice_agent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# NeMo Voice Agent

A [Pipecat](https://github.com/pipecat-ai/pipecat) example demonstrating the simplest way to create a voice agent using NVIDIA NeMo STT/TTS service and HuggingFace LLM. Everything is open-source and deployed locally so you can have your own voice agent. Feel free to explore the code and see how different speech technologies can be integrated with LLMs to create a seamless conversation experience. As of now, we only support English input and output, but more languages will be supported in the future.



## ✨ Key Features

- Open-source, local deployment, and flexible customization.
- Allow users to talk to most LLMs from HuggingFace with configurable prompts.
- Streaming speech recognition with low latency.
- FastPitch-HiFiGAN TTS for fast audio response generation.
- Speaker diarization up to 4 speakers across different turns.
- WebSocket server for easy deployment.


## 💡 Upcoming Next
- More accurate and noise-robust streaming ASR models.
- Faster EOU detection and backchannel handling (e.g., bot will not stop speaking when user is saying something like "uhuh", "wow", "i see").
- Better streaming ASR and speaker diarization pipeline.
- Better TTS model with more natural voice.
- Joint ASR and speaker diarization model.
- Function calling, RAG, etc.



## 🚀 Quick Start

### Hardware requirements

- A computer with at least one GPU. At least 18GB VRAM is recommended for using 8B LLMs, and 10GB VRAM for 4B LLMs.
- A microphone connected to the computer.
- A speaker connected to the computer.

### Install dependencies

First, install or update the npm and node.js to the latest version, for example:

```bash
sudo apt-get update
sudo apt-get install -y npm nodejs
```

or:

```bash
curl -fsSL https://fnm.vercel.app/install | bash
. ~/.bashrc
fnm use --install-if-missing 20
```

Second, create a new conda environment with the dependencies:

```bash
conda env create -f environment.yml
```

Then you can activate the environment via `conda activate nemo-voice`.

Alternatively, you can install the dependencies manually in an existing environment via:
```bash
pip install -r requirements.txt
```
The incompatibility errors from pip can be ignored.

### Configure the server

Edit the `server/server_config.yaml` file to configure the server, for example:
- Changing the LLM and prompt you want to use, by either putting a local path to a text file or the whole prompt string. See `example_prompts/` for examples to start with.
- Configure the LLM parameters, such as temperature, max tokens, etc.
- Distribute different components to different GPUs if you have more than one.
- Adjust VAD parameters for sensitivity and end-of-turn detection timeout.

**If you want to access the server from a different machine, you need to change the `baseUrl` in `client/src/app.ts` to the actual ip address of the server machine.**



### Start the server

Open a terminal and run the server via:

```bash
NEMO_PATH=??? # Use your local NeMo path for the latest version
export PYTHONPATH=$NEMO_PATH:$PYTHONPATH

# export HF_TOKEN="hf_..." # Use your own HuggingFace API token if needed, as some models may require.
# export HF_HUB_CACHE="/path/to/your/huggingface/cache" # change where HF cache is stored if you don't want to use the default cache
# export SERVER_CONFIG_PATH="/path/to/your/server_config.yaml" # change where the server config is stored if you have a couple of different configs
python ./server/server.py
```

### Launch the client
In another terminal on the server machine, start the client via:

```bash
cd client
npm install
npm run dev
```

There should be a message in terminal showing the address and port of the client.

### Connect to the client via browser

Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatever address and port is shown in the terminal where the client was launched).

You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button.

**If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.**

If you want to use a different port for client connection, you can modify `client/vite.config.js` to change the `port` variable.

## 📑 Supported Models

### 🤖 LLM

Most LLMs from HuggingFace are supported. A few examples are:
- [nvidia/Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) (default)
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- [nvidia/Nemotron-Mini-4B-Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct)

Please refer to the HuggingFace webpage of each model to configure the model parameters `llm.generation_kwargs` and `llm.apply_chat_template_kwargs` in `server/server_config.yaml` as needed.

### 🎤 ASR

We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms) (default)
- [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)

### 💬 Speaker Diarization

Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are:
- [nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) (default)


Please note that in some circumstances, the diarization model might not work well in noisy environments, or it may confuse the speakers. In this case, you can disable the diarization by setting `diar.enabled` to `false` in `server/server_config.yaml`.

### 🔉 TTS

We use [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) to generate the speech for the LLM response, and it only supports English output. More TTS models will be supported in the future.


### Turn-taking

As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_config.yaml` to control the amount of silence needed to indicate the end of a user's turn.

Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` to the desired list of phrases or a file path to a yaml file containing the list of phrases in `server/server_config.yaml`. Setting it to `null` will disable detecting the backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.


## 📝 Notes & FAQ
- Only one connection to the server is supported at a time, a new connection will disconnect the previous one, but the context will be preserved.
- If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded using a command like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models.
- The current ASR and diarization models are not noise-robust, you might need to use a noise-cancelling microphone or a quiet environment. But we will release better models soon.
- The diarization model works best with speakers that have much more different voices from each other, while it might not work well on some accents due to the limited training data.
- If you see errors like `SyntaxError: Unexpected reserved word` when running `npm run dev`, please update the Node.js version.
- If you see the error `Error connecting: Cannot read properties of undefined (reading 'enumerateDevices')`, it usually means the browser is not allowed to access the microphone. Please check the browser settings and add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list, e.g., via `chrome://flags/#unsafely-treat-insecure-origin-as-secure` for chrome browser.



## ☁️ NVIDIA NIM Services

NVIDIA also provides a variety of [NIM](https://developer.nvidia.com/nim?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.nim%3Adesc%2Ctitle%3Aasc&hitsPerPage=12) services for better ASR, TTS and LLM performance with more efficient deployment on either cloud or local servers.

You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by refering to these Pipecat services:
- [NVIDIA NIM LLM Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nim/llm.py)
- [NVIDIA Riva ASR Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py)
- [NVIDIA Riva TTS Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/tts.py)

For details of available NVIDIA NIM services, please refer to:
- [NVIDIA NIM LLM Service](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html)
- [NVIDIA Riva ASR NIM Service](https://docs.nvidia.com/nim/riva/asr/latest/overview.html)
- [NVIDIA Riva TTS NIM Service](https://docs.nvidia.com/nim/riva/tts/latest/overview.html)


85 changes: 85 additions & 0 deletions examples/voice_agent/client/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Chatbot</title>
<style>
.server-selection {
display: flex;
align-items: center;
gap: 8px;
margin-bottom: 10px;
}

.server-selection label {
font-weight: bold;
color: #333;
}

.server-selection select {
padding: 6px 12px;
border: 1px solid #ccc;
border-radius: 4px;
background-color: white;
font-size: 14px;
cursor: pointer;
}

.server-selection select:focus {
outline: none;
border-color: #2196F3;
box-shadow: 0 0 0 2px rgba(33, 150, 243, 0.2);
}

.server-selection select:disabled {
background-color: #f5f5f5;
cursor: not-allowed;
opacity: 0.6;
}
</style>
</head>

<body>
<div class="container">
<div class="status-bar">
<div class="status">
Transport: <span id="connection-status">Disconnected</span>
</div>
<div class="server-selection">
<label for="server-select">Server:</label>
<select id="server-select">
<option value="websocket">WebSocket Server (Port 8765)</option>
<option value="fastapi">FastAPI Server (Port 8000)</option>
</select>
</div>
<div class="controls">
<button id="connect-btn">Connect</button>
<button id="disconnect-btn" disabled>Disconnect</button>
<button id="mute-btn" disabled>Mute</button>
<button id="reset-btn" disabled>Reset</button>
</div>
</div>

<div class="volume-indicator">
<div class="volume-label">Microphone Volume:</div>
<div class="volume-bar-container">
<div class="volume-bar" id="volume-bar"></div>
</div>
<div class="volume-text" id="volume-text">0%</div>
</div>

<audio id="bot-audio" autoplay></audio>

<div class="debug-panel">
<h3>Debug Info</h3>
<div id="debug-log"></div>
</div>
</div>

<script type="module" src="/src/app.ts"></script>
<link rel="stylesheet" href="/src/style.css">
</body>

</html>
Loading
Loading