NVIDIA
Explore
Models
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

nvidia

parakeet-tdt-0.6b-v2

Run Anywhere

Accurate and optimized English transcriptions with punctuation and word timestamps

ASREnglishNVIDIA NIMNVIDIA Rivaspeech-to-text
Get API Key
API Reference
Accelerated by DGX Cloud

Parakeet-tdt-0.6b-v2 English speech to text model

Description:

Parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

Key Features

  • Accurate word-level timestamp predictions
  • Automatic punctuation and capitalization
  • Robust performance on spoken numbers, and song lyrics transcription

This model is ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Community Model License Agreement.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Model Architecture:

Architecture Type:

FastConformer-TDT

Network Architecture:

  • This model was developed based on FastConformer encoder architecture and TDT decoder.
  • This model has 600 million model parameters.

Input:

Input Type(s): 16kHz Audio
Input Format(s): .wav and .flac audio formats
Input Parameters: 1D (audio signal)
Other Properties Related to Input: Monochannel audio

Output:

Output Type(s): Text
Output Format: String
Output Parameters: 1D (text)
Other Properties Related to Output: Punctuations and Capitalizations included.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Riva

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Volta

[Preferred/Supported] Operating System(s):

  • Linux

Hardware Specific Requirements:

Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

Model Version

Current version: parakeet-0.6b-tdt-v2

Training and Evaluation Datasets:

Training

This model was trained using the NeMo toolkit [3], following the strategies below:

  • Initialized from a wav2vec SSL checkpoint pretrained on the LibriLight dataset[7].
  • Trained for 150,000 steps on 128 A100 GPUs.
  • Dataset corpora were balanced using a temperature sampling value of 0.5.
  • Stage 2 fine-tuning was performed for 2,500 steps on 4 A100 GPUs using approximately 500 hours of high-quality, human-transcribed data of NeMo ASR Set 3.0.

Training was conducted using this example script and TDT configuration.

The tokenizer was constructed from the training set transcripts using this script.

Data Collection Method by dataset

  • Hybrid: Automated, Human

Labeling Method by dataset

  • Hybrid: Synthetic, Human

Properties:

  • Noise robust data from various sources
  • Single channel, 16kHz sampled data

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[2] Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

[3] NVIDIA NeMo Toolkit

[4] Youtube-commons: A massive open corpus for conversational and multimodal data

[5] Yodas: Youtube-oriented dataset for audio and speech

[6] HuggingFace ASR Leaderboard

[7] MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

Inference:

Engine:

  • NVIDIA NeMo

Test Hardware:

  • NVIDIA A10
  • NVIDIA A100
  • NVIDIA A30
  • NVIDIA H100
  • NVIDIA L4
  • NVIDIA L40
  • NVIDIA Turing T4
  • NVIDIA Volta V100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here.

Please report security vulnerabilities or NVIDIA AI Concerns here.