
A voice agent that uses the Nemotron model to generate responses to voice commands.
Voice is the most natural human interface, allowing for efficient and high-speed communication. This developer example provides a comprehensive, end-to-end voice agent blueprint built with NVIDIA Nemotron state-of-the-art open models, as NVIDIA NIM for acceleration and scaling. It is designed to guide developers through the creation of a cascaded pipeline, integrating Nemotron ASR, LLM, and TTS, while solving for the complexities of streaming, interruptible conversations. By leveraging NVIDIA NIM microservices, this developer example enables developers to accelerate the deployment of high-performance voice AI solutions.
| Model / API | Reasoning Mode | Text Only Standalone LLM (%) | LLM In Voice Agent Pipeline (%) |
|---|---|---|---|
| Nemotron 49B | Reasoning ON | 91.90 | 81.30 |
| Nemotron 49B | Reasoning OFF | 82.70 | 60.30 |
| Nemotron 30B | Reasoning ON Reasoning Budget - 500 | 78.76 | 75.60 |
| Nemotron 30B | Reasoning OFF | 56.50 | 50.40 |
Benchmarks based on internal testing. Evaluation source code provided on GitHub.
| Parallel Streams | E2E Latency | ASR Latency | TTS TTFB | LLM TTFT | LLM first-sentence latency |
|---|---|---|---|---|---|
| 1 | 0.79 | 0.04 | 0.078 | 0.126 | 0.138 |
| 4 | 0.76 | 0.046 | 0.066 | 0.061 | 0.181 |
| 8 | 0.77 | 0.052 | 0.066 | 0.062 | 0.136 |
| 16 | 0.91 | 0.057 | 0.068 | 0.105 | 0.208 |
| 32 | 0.8 | 0.061 | 0.08 | 0.073 | 0.294 |
| 64 | 1 | 0.067 | 0.11 | 0.156 | 0.386 |
The benchmark table demonstrates that the NVIDIA Nemotron Voice Agent achieves sub-second End-to-End Latency across up to 64 parallel streams with a setup utilizing 3xH100 GPUs (one for Parakeet CTC 1.1B, one for Magpie TTS, and two for Nemotron-3-Nano LLM model) with speculative speech processing enabled.
This developer example is powered by a suite of NVIDIA-optimized microservices designed for maximum throughput and minimal latency.
| Category | Component | Recommended Model |
|---|---|---|
| Speech-to-Text or Automatic Speech Recognition | ASR / AST | NVIDIA Nemotron Speech (RNNT or CTC) |
| Logic & Reasoning | LLM | Nemotron Nano / Nemotron Super |
| Text-to-Speech | TTS | Magpie TTS Multilingual |
| Control | Behavioral Logic | VAD, SVAD, EOU |
To achieve sub-second response times and high-fidelity audio handling, the following hardware configurations are recommended for local deployment.
| Service | Use Case | Recommended GPU |
|---|---|---|
| Nemotron Speech ASR/TTS | Audio Transcription & Synthesis | 1x L40, A100 (80GB), or H100 |
| Reasoning Model | LLM & Agentic Logic | 2x H100 (80GB) or 4x A100 (80GB) |
| Voice Agent | Entire workflow | Jetson Thor |
The code base serves as a playground to test new models and expand basic ASR/LLM/TTS flows.
Clone the Repo: Access the public reference code on GitHub.
Setup NVIDIA NIM: Deploy your local system/pipeline using NVIDIA NIM microservices.
This developer example empowers developers to rapidly build, customize, and deploy enterprise-grade voice agents for customer service, and user interactions in healthcare, telecom, retail, and financial services.
Explore the Ambient Healthcare Agents blueprint to deploy ambient agents that assist with patient intake, symptom triage, and compliance for HIPAA/PCI. This blueprint has tightly integrated, healthcare-tuned models (e.g., clinical LLMs, medical diarization, guardrails for HIPAA alignment, SOAP/ICD form automation). It is designed to be “out-of-the-box” for clinical scenarios.
NVIDIA believes trustworthy AI is a shared responsibility. When using this example in accordance with our terms of service, work with your model and compliance teams to ensure the system meets requirements for your industry and use case. Report security or AI concerns here.