Model Overview

Description:

NVIDIA Nemotron™ 3 VoiceChat is a 12B end-to-end, realtime full duplex speech-to-speech model for conversational AI that jointly performs streaming speech understanding and speech generation [1]. Unlike traditional cascaded stacks (ASR → LLM → TTS), this model achieves full duplex, real-time, seamless voice interaction in one unified architecture, eliminating the need for multiple models or API handoffs, reducing end-to-end latency.
It sets new benchmarks by bringing open, robust, and highly natural conversation capabilities to enterprises.

The model operates on audio signals, which are encoded using a fast conformer module. The resulting audio tokens are inputted into a Nemotron Nano V2 9B LLM backbone to predict text tokens, which are fed to a TTS decoder [2] to predict audio codes for generating the agent's speech. The model supports persona control through text-based role prompts based on NVIDIA PersonaPlex [3]. Nemotron 3 VoiceChat offers an unprecedented trade-off between intelligence and latency in the space of open source voice agents, as highlighted by our benchmarking results below.

This model is ready for early access evaluation purposes only.

License/Terms of Use:

GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; use of NIM container and the use of this model is governed by the NVIDIA Software and Model Evaluation License.

Use Case:

Nemotron 3 VoiceChat is targeted for researchers, developers, and professionals in the field of natural language processing (NLP) and speech technology for purposes such as automatic speech recognition (ASR), text-to-speech synthesis (TTS), and voice assistant development.

Deployment Geography:

Global

Release Date:

NGC [03/16/2026] via Nemotron Voicechat Model on NGC

References(s):

[1] SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
[2] Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
[3] PersonaPlex: Voice and role control for full duplex conversational speech models

Model Architecture:

Architecture Type: Hybrid Mamba/Transformer

Network Architecture:

Nemotron VoiceChat uses:

Fast Conformer Speech Encoder from Nemotron-Speech-Streaming-En-0.6b
NVIDIA Nemotron Nano v2 LLM backbone
NVIDIA TTS decoder and codec [2]
Number of model parameters: 12B

Input(s):

Input Type(s): Text (prompt), Audio (user speech)
Input Format: String, WAV/WebAudio
Input Parameters: One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Input: 16 kHz sample rate for audio.

Output(s)

Output Type(s): Text (agent text), Audio (agent speech), Text (user speech transcription)
Output Format: String, WAV/WebAudio
Output Parameters: One-Dimensional (1D), One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Output: 22.05 kHz sample rate for audio.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine: vLLM

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper (H100)

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

Data Modality: Audio (speech) and Text
Audio Training Data Size: 10,000 to 1 Million Hours

VoiceChat has been trained on a blend of different datasets comprising both real audio datasets and synthetic speech datasets generated using various TTS systems on text training corpora, including:

Nemotron 5.5 pre-training and SFT text data
Brainy-mantis text data
Greteal AI v1 and v2 text data
Ultrachat text data
Blackwell studio recordings real speech data
Fisher real speech data
LibriVox
LibriTTS
HiFi-TTS
Riva Speakers: Internal Dataset
Publicly available internet scale data
PromptTTS
VCTK
Voxmovies
JL-Corpus
Nemotron Nano v3 function calling data
Persona Plex training datasets

Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.

Testing/Evaluation Dataset:

VoiceBench

Link: VoiceBench
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.
Properties: VoiceChat is #1 amongst all open full-duplex models on VoiceBench, a benchmark dataset developed to evaluate large language model (LLM)-based voice assistants, focusing on real-world spoken interactions rather than just text or clean speech recognition. It combines audio and text data. The dataset includes multiple subsets covering tasks like open-ended questions, multiple-choice QA, instruction following, and adversarial cases - sourced from both real human speech and synthetic text-to-speech examples.

Benchmark Scores:

Metric	Value
Text-output average accuracy	58.1

FullDuplexBench 1.0

Link: FullDuplexBench 1.0
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated.
Labeling Method by dataset: Automated.
Properties: Nemotron 3 VoiceChat is #2 amongst all open models on FullDuplexBench 1.0,which is a benchmark designed to evaluate the interactive capabilities of full-duplex spoken dialogue models like VoiceChat. It focuses on measuring natural, human-like conversational behaviors such as pause handling, backchanneling, smooth turn-taking, and user interruption management, using automatic metrics to provide consistent, reproducible assessments of model performance.

Benchmark Scores:

Metric	Value
Pause Handling(Synthetic): TOR↓	0.55
Pause Handling(Candor): TOR↓	0.69
Smooth Turn Taking: TOR↑	1.00
Smooth Turn Taking: Latency↓	0.26
User Interruption: TOR↑	1.00
User Interruption: GPT-4o↑	4.18

Artificial Analysis shows Nemotron 3 VoiceChat is the best open model as balanced for conversational dynamics and speech reasoning amongst all open models.

Inference:

Acceleration Engine: vLLM
Test Hardware: NVIDIA H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

nvidia

nemotron-voicechat

NVIDIA

nemotron-voicechat

Model Overview

Description:

License/Terms of Use:

Use Case:

Deployment Geography:

Release Date:

References(s):

Model Architecture:

Input(s):

Output(s)

Software Integration:

Model Version(s):

Training, Testing, and Evaluation Datasets:

Training Dataset:

Testing/Evaluation Dataset:

VoiceBench

FullDuplexBench 1.0

Inference:

Ethical Considerations: