Model Overview

Description

Nemotron 3 ASR (Nemotron-ASR-Streaming) is an English, streaming Automatic Speech Recognition (ASR) engineered to deliver high-quality English transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization.

By leveraging a state-of-the-art Cache-Aware FastConformer-RNNT architecture, the model eliminates redundant overlapping computations common in traditional "buffered" streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.

It transcribes speech into the English alphabet, spaces, and apostrophes, with full support for punctuation and capitalization. Trained on the ASRSet, a massive dataset of approximately 250,000 hours of US English (en-US) speech, it is engineered to perform across diverse and challenging acoustic conditions.

This model is ready for commercial/non-commercial use.

License/Terms of Use

Governing Terms: This trial service is governed by the NVIDIA API Trial Terms of Service. The NIM container is governed by the NVIDIA Software License Agreement; use of the model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography

Global

Use Case

This model is for transcription of English audio.

Release Date

Build.Nvidia.com [03/12/2026] via [URL]
Hugging Face [03/12/2026] via [URL]
NGC [03/12/2026] via [URL]

(URLs to be added.)

References

Model Architecture

Architecture Type: FastConformer-CacheAware-RNNT

The model is based on the Cache-Aware [1] FastConformer [2] architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping.

Network Architecture:

Encoder: Cache-Aware FastConformer with 24 layers
Decoder: RNNT (Recurrent Neural Network Transducer)

This model was developed based on nvidia/nemotron-speech-streaming-en-0.6b [1].

Number of model parameters: 600M

Input(s)

Input Type(s): Audio

Input Format(s): wav

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Output(s)

Output Type(s): Text String in English

Output Format(s): String

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: No Maximum Character Length, transcribe punctuation and capitalization. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Riva 2.25.0 or higher

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Turing
NVIDIA Volta

Preferred/Supported Operating System(s):

Linux
Linux 4 Tegra

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

nemotron-asr-streaming-riva-v1

Training, Testing and Evaluation Datasets

Training Dataset

Data Modality: Audio

Audio Training Data Size: 10,000 to 1 Million Hours

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human
Synthetic: labels generated from Parakeet-CTC-XXL-1.1b model, PnC generated from Qwen32B

Properties: In excess of 250,000 hours of English (en-US) copyright protected speech comprised of harvested / internal / public datasets normalized to have spoken forms in text with punctuation and capitalization.

Testing Dataset

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties: A dynamic blend of public and internal proprietary normalized to spoken forms in text with punctuation and capitalization.

Evaluation Dataset

Data Collection Method by dataset:

Human
Synthetic: waveform processed by artificial audio effects

Labeling Method by dataset:

Human
Synthetic: labels generated by OpenAI's open-source Whisper-v3 model

Properties: A dynamic blend of public, internal proprietary, and customer datasets normalized to spoken forms in text with punctuation and capitalization.

Inference

Acceleration Engine: Triton

Test Hardware:

NVIDIA Blackwell
NVIDIA A10
NVIDIA A100
NVIDIA A30
NVIDIA H100
NVIDIA L4
NVIDIA L40

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Get Help

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.