
NVIDIA Nemotron™ 3 VoiceChat is a 12B end-to-end, realtime full duplex speech-to-speech model for conversational AI that jointly performs streaming speech understanding and speech generation [1]. Unlike traditional cascaded stacks (ASR → LLM → TTS), this model achieves full duplex, real-time, seamless voice interaction in one unified architecture, eliminating the need for multiple models or API handoffs, reducing end-to-end latency.
It sets new benchmarks by bringing open, robust, and highly natural conversation capabilities to enterprises.
The model operates on audio signals, which are encoded using a fast conformer module. The resulting audio tokens are inputted into a Nemotron Nano V2 9B LLM backbone to predict text tokens, which are fed to a TTS decoder [2] to predict audio codes for generating the agent's speech. The model supports persona control through text-based role prompts based on NVIDIA PersonaPlex [3]. Nemotron 3 VoiceChat offers an unprecedented trade-off between intelligence and latency in the space of open source voice agents, as highlighted by our benchmarking results below.
This model is ready for early access evaluation purposes only.
GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; use of NIM container and the use of this model is governed by the NVIDIA Software and Model Evaluation License.
Nemotron 3 VoiceChat is targeted for researchers, developers, and professionals in the field of natural language processing (NLP) and speech technology for purposes such as automatic speech recognition (ASR), text-to-speech synthesis (TTS), and voice assistant development.
Global
NGC [03/16/2026] via Nemotron Voicechat Model on NGC
[1] SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
[2] Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
[3] PersonaPlex: Voice and role control for full duplex conversational speech models
Architecture Type: Hybrid Mamba/Transformer
Network Architecture:
Nemotron VoiceChat uses:
Input Type(s): Text (prompt), Audio (user speech)
Input Format: String, WAV/WebAudio
Input Parameters: One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Input: 16 kHz sample rate for audio.
Output Type(s): Text (agent text), Audio (agent speech), Text (user speech transcription)
Output Format: String, WAV/WebAudio
Output Parameters: One-Dimensional (1D), One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Output: 22.05 kHz sample rate for audio.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine: vLLM
Supported Hardware Microarchitecture Compatibility:
Preferred/Supported Operating System(s):
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Data Modality: Audio (speech) and Text
Audio Training Data Size: 10,000 to 1 Million Hours
VoiceChat has been trained on a blend of different datasets comprising both real audio datasets and synthetic speech datasets generated using various TTS systems on text training corpora, including:
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.
Link: VoiceBench
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.
Properties: VoiceChat is #1 amongst all open full-duplex models on VoiceBench, a benchmark dataset developed to evaluate large language model (LLM)-based voice assistants, focusing on real-world spoken interactions rather than just text or clean speech recognition. It combines audio and text data. The dataset includes multiple subsets covering tasks like open-ended questions, multiple-choice QA, instruction following, and adversarial cases - sourced from both real human speech and synthetic text-to-speech examples.
Benchmark Scores:
| Metric | Value |
|---|---|
| Text-output average accuracy | 58.1 |
Link: FullDuplexBench 1.0
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.
Properties: Nemotron 3 VoiceChat is #2 amongst all open models on FullDuplexBench 1.0,which is a benchmark designed to evaluate the interactive capabilities of full-duplex spoken dialogue models like VoiceChat. It focuses on measuring natural, human-like conversational behaviors such as pause handling, backchanneling, smooth turn-taking, and user interruption management, using automatic metrics to provide consistent, reproducible assessments of model performance.
Benchmark Scores:
| Metric | Value |
|---|---|
| Pause Handling(Synthetic): TOR↓ | 0.55 |
| Pause Handling(Candor): TOR↓ | 0.69 |
| Smooth Turn Taking: TOR↑ | 1.00 |
| Smooth Turn Taking: Latency↓ | 0.26 |
| User Interruption: TOR↑ | 1.00 |
| User Interruption: GPT-4o↑ | 4.18 |
Artificial Analysis shows Nemotron 3 VoiceChat is the best open model as balanced for conversational dynamics and speech reasoning amongst all open models.
Acceleration Engine: vLLM
Test Hardware: NVIDIA H100
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.