
Nemotron 3 ASR (Nemotron-ASR-Streaming) is an English, streaming Automatic Speech Recognition (ASR) engineered to deliver high-quality English transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization.
By leveraging a state-of-the-art Cache-Aware FastConformer-RNNT architecture, the model eliminates redundant overlapping computations common in traditional "buffered" streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.
It transcribes speech into the English alphabet, spaces, and apostrophes, with full support for punctuation and capitalization. Trained on the ASRSet, a massive dataset of approximately 250,000 hours of US English (en-US) speech, it is engineered to perform across diverse and challenging acoustic conditions.
This model is ready for commercial/non-commercial use.
Governing Terms: This trial service is governed by the NVIDIA API Trial Terms of Service. The NIM container is governed by the NVIDIA Software License Agreement; use of the model is governed by the NVIDIA Open Model License Agreement.
Global
This model is for transcription of English audio.
(URLs to be added.)
Architecture Type: FastConformer-CacheAware-RNNT
The model is based on the Cache-Aware [1] FastConformer [2] architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping.
Network Architecture:
This model was developed based on nvidia/nemotron-speech-streaming-en-0.6b [1].
Number of model parameters: 600M
Input Type(s): Audio
Input Format(s): wav
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Output Type(s): Text String in English
Output Format(s): String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: No Maximum Character Length, transcribe punctuation and capitalization. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Preferred/Supported Operating System(s):
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
nemotron-asr-streaming-riva-v1
Data Modality: Audio
Audio Training Data Size: 10,000 to 1 Million Hours
Data Collection Method by dataset:
Labeling Method by dataset:
Properties: In excess of 250,000 hours of English (en-US) copyright protected speech comprised of harvested / internal / public datasets normalized to have spoken forms in text with punctuation and capitalization.
Data Collection Method by dataset:
Labeling Method by dataset:
Properties: A dynamic blend of public and internal proprietary normalized to spoken forms in text with punctuation and capitalization.
Data Collection Method by dataset:
Labeling Method by dataset:
Properties: A dynamic blend of public, internal proprietary, and customer datasets normalized to spoken forms in text with punctuation and capitalization.
Acceleration Engine: Triton
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Get access to knowledge base articles and support cases or submit a ticket.