NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

studiovoice

DownloadableFree Endpoint

Enhance input speech recorded with low-quality microphones in noisy or reverberant environments, producing studio-quality speech.

communicationsmic qualitynvidia ai for mediaspeech enhancement
Get API Key
API ReferenceAPI Reference
Accelerated by DGX Cloud

Model Overview

Description:

Studio Voice is a speech enhancement model from NVIDIA. Studio Voice enhances input speech recorded with low-quality microphones in noisy or reverberant environments, producing studio-quality speech.

This model is ready for commercial use.

License/Terms of Use:

Use of this model is governed by the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

Studio Voice models removes microphone artifacts and room reverberations to produce studio quality voice. It is intended to be used by content developers and broadcasters.

Release Date:

NGC [03/16/2026] via afx_studio_voice

Model Architecture

Architecture Type: Convolution Neural Networks (CNNs), Transformers, Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
Number of model parameters: 183M

Input(s):

Input Type(s): Audio
Input Format(s): PCM F32
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16kHz or 48kHz sampling rate required.

Output(s):

Output Type(s): Audio
Output Format: PCM F32
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: PCM audio samples at input sampling rate with no encoding or post-processing.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

  • Audio Effects 2.1.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Turing

Preferred/Supported Operating System(s):

  • Linux
  • Windows

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s)

StudioVoice v0.5.1

Training, Testing, and Evaluation Datasets

Data Modality:

  • Audio

Audio Training Data Size:

  • Less than 10,000 Hours

Dataset partition: Training [80%], Testing [10%], Validation [10%]

NVIDIA models are trained on a diverse set of public and proprietary datasets. The Studio Voice model is trained on a dataset that comprises of diverse English accents and different types of microphone devices.

Data Collection Method by dataset: [Hybrid: Human, Synthetic] Labeling Method by dataset: [Hybrid: Human, Synthetic]

Link: DAPS
Properties:
The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).

Link: LibriTTS
Properties:
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16kHZ.

Link: VCTK
Properties:
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Link: HiFi-TTS
Properties:
A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.

Link: Device Recorded VCTK (DR-VCTK)
Properties:
Device recorded version of VCTK dataset on common consumer devices (laptop, tablet and smartphone) in office environment. This dataset contains 109 English speakers with different accents. There are around 400 sentences available from each speaker. For this recording, 8 different microphones were used. This dataset contains around 250 Gb of re-recorded speech.

Link: Dataset of impulse responses from variable acoustics room Arni at Aalto Acoustic Labs
Properties:
A dataset of impulse responses collected in the variable acoustics laboratory Arni at Acoustics Lab of Aalto University, Espoo, Finland. IRs of 5342 configurations of sound absorption in Arni are included in the dataset. Each of them were measured using an omnidirectional sound source and 5 sound receivers. For each configuration, 5 impulse reponses (IRs) were captured. The total number of measurements in the dataset is 132 037.

Link: Room Impulse Response and Noise Database
Properties:
A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16KHz sampling rate and 16-bit precision.

Link: DNS Challenge 5
Properties:
Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.

Link: AudioSet
Properties:
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from publicly available internet scale data.

Link: Multi-Language and Emotions Speech Dataset
Properties:
Multi-Language and Emotions Speech Dataset contains high quality speech data worth approximately 140 hours and contains approximately 80 unique speakers. The dataset contains different english accents and contains different emotions as well. This dataset is taken from publicly available internet scale data.

Link: Audio2Gesture Bones Dataset
Properties:
Audio2Gesture Bones Dataset is a speech dataset which is bought from Bones.studio company. It contains only 2 speakers covering all different types of emotions. It is a small dataset of around 4 GB and contains 7 hours of speech data.

Testing Datasets

Data Collection Method by dataset: [Hybrid: Human, Synthetic]
Labeling Method by dataset: [Hybrid: Human, Synthetic]

Properties:

The Studio Voice model is tested on a dataset that comprises of diverse English accents and different types of microphone devices. Test data is taken by sampling 10% of the training dataset mentioned above. The modality and data type is same as that of the training dataset.

Evaluation Datasets

Data Collection Method by dataset: [Hybrid: Human, Synthetic]
Labeling Method by dataset: [Hybrid: Human, Synthetic]

Properties:

The Studio Voice model is evaluated on a dataset that comprises of diverse English accents and different types of microphone devices. Evaluation data is taken by sampling 10% of the training dataset mentioned above. The modality and data type is same as that of the training dataset.

Inference

Acceleration Engine: Tensor(RT), Triton
Test Hardware:

  • T4, T10, A30, A100, A2, A10, A16, A40, L4, L40, H100, B40, B100
  • RTX 4080, RTX 4090, RTX 5070, RTX 5080, RTX 5090

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.