-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add NeMo Voice Agent #14325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add NeMo Voice Agent #14325
Changes from 40 commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
12f72f2
update streaming ASR
stevehuang52 e4f5663
add voice agent
stevehuang52 fda5450
update readme
stevehuang52 f843762
update websocket
stevehuang52 16a27ba
update
stevehuang52 94b43bc
update
stevehuang52 6ff5302
update readme
stevehuang52 b45cb1a
update
stevehuang52 6c21c77
clean up
stevehuang52 6118ac9
clean up
stevehuang52 7e1e62a
fix typo
stevehuang52 ab723a0
fix codeQL
stevehuang52 af9b523
update cfg
stevehuang52 08ba934
remove unused
stevehuang52 d23e113
update readme
stevehuang52 de7cacc
change default models
stevehuang52 3711938
fix diar diable
stevehuang52 9200a52
fix diar diable
stevehuang52 af315bd
update ux
stevehuang52 c22f41e
update tts
stevehuang52 d2451b4
update readme
stevehuang52 ff705be
fix and update
stevehuang52 79f7996
fix asr
stevehuang52 04e6bb6
update readmme
stevehuang52 a0df9f3
update doc and llm dtype
stevehuang52 5c51c29
refactor and add example prompts
stevehuang52 025244d
update readme
stevehuang52 216278a
update readme
stevehuang52 72d0c67
clean up
stevehuang52 149487e
clean up
stevehuang52 be178a6
Merge branch 'main' into heh/nemo_voice
stevehuang52 9c14702
update info on streaming sortformer
stevehuang52 de1b138
move code to 'nemo/agents/voice_agent'
stevehuang52 4997feb
update doc
stevehuang52 f31bd13
clean up
stevehuang52 5cc5e26
refactor
stevehuang52 f56f9fc
Merge branch 'main' into heh/nemo_voice
tango4j 55f8191
Merge branch 'main' into heh/nemo_voice
KunalDhawan b22a150
update doc
stevehuang52 6c1019a
Merge branch 'heh/nemo_voice' of https://github.com/NVIDIA/NeMo into …
stevehuang52 98006c0
remove the unnecessary streaming state conversion and import it from …
weiqingw4ng e5665b0
Apply isort and black reformatting
weiqingw4ng e23935d
update doc
stevehuang52 9e9ed7d
Merge branch 'heh/nemo_voice' of https://github.com/NVIDIA/NeMo into …
stevehuang52 f752abc
clean up
stevehuang52 1ffddb2
fix for llama-nemotron template, and refactor
stevehuang52 500b396
fix tts separator
stevehuang52 a59733c
fix for llama-nemotron
stevehuang52 90cbfc1
update cfg
stevehuang52 30a55bc
refactor and update doc
stevehuang52 de148ae
change default llm to qwen
stevehuang52 f3572f7
update doc
stevehuang52 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -181,3 +181,6 @@ examples/neural_graphs/*.yml | |
| nemo_experiments/ | ||
|
|
||
| slurm*.out | ||
|
|
||
| node_modules/ | ||
| .vite/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # NeMo Voice Agent | ||
|
|
||
| A [Pipecat](https://github.com/pipecat-ai/pipecat) example demonstrating the simplest way to create a voice agent using NVIDIA NeMo STT/TTS service and HuggingFace LLM. Everything is open-source and deployed locally so you can have your own voice agent. Feel free to explore the code and see how different speech technologies can be integrated with LLMs to create a seamless conversation experience. As of now, we only support English input and output, but more languages will be supported in the future. | ||
|
|
||
|
|
||
|
|
||
| ## ✨ Key Features | ||
|
|
||
| - Open-source, local deployment, and flexible customization. | ||
| - Allow users to talk to most LLMs from HuggingFace with configurable prompts. | ||
| - Streaming speech recognition with low latency. | ||
| - FastPitch-HiFiGAN TTS for fast audio response generation. | ||
| - Speaker diarization up to 4 speakers across different turns. | ||
| - WebSocket server for easy deployment. | ||
|
|
||
|
|
||
| ## 💡 Upcoming Next | ||
| - More accurate and noise-robust streaming ASR models. | ||
| - Faster EOU detection and backchannel handling (e.g., bot will not stop speaking when user is saying something like "uhuh", "wow", "i see"). | ||
| - Better streaming ASR and speaker diarization pipeline. | ||
| - Better TTS model with more natural voice. | ||
| - Joint ASR and speaker diarization model. | ||
| - Function calling, RAG, etc. | ||
|
|
||
|
|
||
|
|
||
| ## 🚀 Quick Start | ||
|
|
||
| ### Hardware requirements | ||
|
|
||
| - A computer with at least one GPU. At least 18GB VRAM is recommended for using 8B LLMs, and 10GB VRAM for 4B LLMs. | ||
| - A microphone connected to the computer. | ||
| - A speaker connected to the computer. | ||
|
|
||
| ### Install dependencies | ||
|
|
||
| First, install or update the npm and node.js to the latest version, for example: | ||
|
|
||
| ```bash | ||
| sudo apt-get update | ||
| sudo apt-get install -y npm nodejs | ||
| ``` | ||
|
|
||
| or: | ||
|
|
||
| ```bash | ||
| curl -fsSL https://fnm.vercel.app/install | bash | ||
| . ~/.bashrc | ||
| fnm use --install-if-missing 20 | ||
| ``` | ||
|
|
||
| Second, create a new conda environment with the dependencies: | ||
|
|
||
| ```bash | ||
| conda env create -f environment.yml | ||
| ``` | ||
|
|
||
| Then you can activate the environment via `conda activate nemo-voice`. | ||
|
|
||
| Alternatively, you can install the dependencies manually in an existing environment via: | ||
| ```bash | ||
| pip install -r requirements.txt | ||
| ``` | ||
| The incompatibility errors from pip can be ignored. | ||
|
|
||
| ### Configure the server | ||
|
|
||
| Edit the `server/server_config.yaml` file to configure the server, for example: | ||
| - Changing the LLM and prompt you want to use, by either putting a local path to a text file or the whole prompt string. See `example_prompts/` for examples to start with. | ||
| - Configure the LLM parameters, such as temperature, max tokens, etc. | ||
| - Distribute different components to different GPUs if you have more than one. | ||
| - Adjust VAD parameters for sensitivity and end-of-turn detection timeout. | ||
|
|
||
| **If you want to access the server from a different machine, you need to change the `baseUrl` in `client/src/app.ts` to the actual ip address of the server machine.** | ||
|
|
||
|
|
||
|
|
||
| ### Start the server | ||
|
|
||
| Open a terminal and run the server via: | ||
|
|
||
| ```bash | ||
| NEMO_PATH=??? # Use your local NeMo path for the latest version | ||
| export PYTHONPATH=$NEMO_PATH:$PYTHONPATH | ||
|
|
||
| # export HF_TOKEN="hf_..." # Use your own HuggingFace API token if needed, as some models may require. | ||
| # export HF_HUB_CACHE="/path/to/your/huggingface/cache" # change where HF cache is stored if you don't want to use the default cache | ||
| # export SERVER_CONFIG_PATH="/path/to/your/server_config.yaml" # change where the server config is stored if you have a couple of different configs | ||
| python ./server/server.py | ||
| ``` | ||
|
|
||
| ### Launch the client | ||
| In another terminal on the server machine, start the client via: | ||
|
|
||
| ```bash | ||
| cd client | ||
| npm install | ||
| npm run dev | ||
| ``` | ||
|
|
||
| There should be a message in terminal showing the address and port of the client. | ||
|
|
||
| ### Connect to the client via browser | ||
|
|
||
| Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatever address and port is shown in the terminal where the client was launched). | ||
|
|
||
| You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button. | ||
|
|
||
| **If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.** | ||
|
|
||
| If you want to use a different port for client connection, you can modify `client/vite.config.js` to change the `port` variable. | ||
|
|
||
| ## 📑 Supported Models | ||
|
|
||
| ### 🤖 LLM | ||
|
|
||
| Most LLMs from HuggingFace are supported. A few examples are: | ||
| - [nvidia/Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) (default) | ||
| - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | ||
| - [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | ||
| - [nvidia/Nemotron-Mini-4B-Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct) | ||
|
|
||
| Please refer to the HuggingFace webpage of each model to configure the model parameters `llm.generation_kwargs` and `llm.apply_chat_template_kwargs` in `server/server_config.yaml` as needed. | ||
|
|
||
| ### 🎤 ASR | ||
|
|
||
| We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now: | ||
| - [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms) (default) | ||
| - [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi) | ||
|
|
||
| ### 💬 Speaker Diarization | ||
|
|
||
| Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are: | ||
| - [nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) (default) | ||
|
|
||
|
|
||
| Please note that in some circumstances, the diarization model might not work well in noisy environments, or it may confuse the speakers. In this case, you can disable the diarization by setting `diar.enabled` to `false` in `server/server_config.yaml`. | ||
|
|
||
| ### 🔉 TTS | ||
|
|
||
| We use [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) to generate the speech for the LLM response, and it only supports English output. More TTS models will be supported in the future. | ||
|
|
||
|
|
||
| ### Turn-taking | ||
|
|
||
| As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_config.yaml` to control the amount of silence needed to indicate the end of a user's turn. | ||
|
|
||
| Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` to the desired list of phrases or a file path to a yaml file containing the list of phrases in `server/server_config.yaml`. Setting it to `null` will disable detecting the backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking. | ||
|
|
||
|
|
||
| ## 📝 Notes & FAQ | ||
| - Only one connection to the server is supported at a time, a new connection will disconnect the previous one, but the context will be preserved. | ||
| - If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded using a command like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models. | ||
| - The current ASR and diarization models are not noise-robust, you might need to use a noise-cancelling microphone or a quiet environment. But we will release better models soon. | ||
| - The diarization model works best with speakers that have much more different voices from each other, while it might not work well on some accents due to the limited training data. | ||
| - If you see errors like `SyntaxError: Unexpected reserved word` when running `npm run dev`, please update the Node.js version. | ||
| - If you see the error `Error connecting: Cannot read properties of undefined (reading 'enumerateDevices')`, it usually means the browser is not allowed to access the microphone. Please check the browser settings and add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list, e.g., via `chrome://flags/#unsafely-treat-insecure-origin-as-secure` for chrome browser. | ||
|
|
||
|
|
||
|
|
||
| ## ☁️ NVIDIA NIM Services | ||
|
|
||
| NVIDIA also provides a variety of [NIM](https://developer.nvidia.com/nim?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.nim%3Adesc%2Ctitle%3Aasc&hitsPerPage=12) services for better ASR, TTS and LLM performance with more efficient deployment on either cloud or local servers. | ||
|
|
||
| You can also modify the `server/bot_websocket_server.py` to use NVIDIA NIM services for better LLM,ASR and TTS performance, by refering to these Pipecat services: | ||
| - [NVIDIA NIM LLM Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nim/llm.py) | ||
| - [NVIDIA Riva ASR Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py) | ||
| - [NVIDIA Riva TTS Service](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/tts.py) | ||
|
|
||
| For details of available NVIDIA NIM services, please refer to: | ||
| - [NVIDIA NIM LLM Service](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) | ||
| - [NVIDIA Riva ASR NIM Service](https://docs.nvidia.com/nim/riva/asr/latest/overview.html) | ||
| - [NVIDIA Riva TTS NIM Service](https://docs.nvidia.com/nim/riva/tts/latest/overview.html) | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| <!DOCTYPE html> | ||
| <html lang="en"> | ||
|
|
||
| <head> | ||
| <meta charset="UTF-8"> | ||
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
| <title>AI Chatbot</title> | ||
| <style> | ||
| .server-selection { | ||
| display: flex; | ||
| align-items: center; | ||
| gap: 8px; | ||
| margin-bottom: 10px; | ||
| } | ||
|
|
||
| .server-selection label { | ||
| font-weight: bold; | ||
| color: #333; | ||
| } | ||
|
|
||
| .server-selection select { | ||
| padding: 6px 12px; | ||
| border: 1px solid #ccc; | ||
| border-radius: 4px; | ||
| background-color: white; | ||
| font-size: 14px; | ||
| cursor: pointer; | ||
| } | ||
|
|
||
| .server-selection select:focus { | ||
| outline: none; | ||
| border-color: #2196F3; | ||
| box-shadow: 0 0 0 2px rgba(33, 150, 243, 0.2); | ||
| } | ||
|
|
||
| .server-selection select:disabled { | ||
| background-color: #f5f5f5; | ||
| cursor: not-allowed; | ||
| opacity: 0.6; | ||
| } | ||
| </style> | ||
| </head> | ||
|
|
||
| <body> | ||
| <div class="container"> | ||
| <div class="status-bar"> | ||
| <div class="status"> | ||
| Transport: <span id="connection-status">Disconnected</span> | ||
| </div> | ||
| <div class="server-selection"> | ||
| <label for="server-select">Server:</label> | ||
| <select id="server-select"> | ||
| <option value="websocket">WebSocket Server (Port 8765)</option> | ||
| <option value="fastapi">FastAPI Server (Port 8000)</option> | ||
| </select> | ||
| </div> | ||
| <div class="controls"> | ||
| <button id="connect-btn">Connect</button> | ||
| <button id="disconnect-btn" disabled>Disconnect</button> | ||
| <button id="mute-btn" disabled>Mute</button> | ||
| <button id="reset-btn" disabled>Reset</button> | ||
| </div> | ||
| </div> | ||
|
|
||
| <div class="volume-indicator"> | ||
| <div class="volume-label">Microphone Volume:</div> | ||
| <div class="volume-bar-container"> | ||
| <div class="volume-bar" id="volume-bar"></div> | ||
| </div> | ||
| <div class="volume-text" id="volume-text">0%</div> | ||
| </div> | ||
|
|
||
| <audio id="bot-audio" autoplay></audio> | ||
|
|
||
| <div class="debug-panel"> | ||
| <h3>Debug Info</h3> | ||
| <div id="debug-log"></div> | ||
| </div> | ||
| </div> | ||
|
|
||
| <script type="module" src="/src/app.ts"></script> | ||
| <link rel="stylesheet" href="/src/style.css"> | ||
| </body> | ||
|
|
||
| </html> |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.