nvidia/studiovoice
Enhance speech by correcting common audio degradations to create studio quality speech output.
Model Overview
Description
Maxine Studio Voice enhances the input speech recorded through low quality microphones in noisy/reverberant environment to studio-recorded quality speech.
Studio Voice is available under NVIDIA Maxine — a developer platform for deploying AI features that enhance audio, video, and creating new experiences in real-time audio-video communication. Maxine's state-of-the-art models create high-quality AI effects using standard microphones and cameras without additional special equipments.
NVIDIA Maxine is exclusively part of NVIDIA AI Enterprise for production workflows — an extensive library of full-stack software, including AI solution workflows, frameworks, pre-trained models, and infrastructure optimization.
Terms of use
The use of NVIDIA Maxine's Studio Voice is available as a demonstration of the input and output of the Studio Voice generative model. As such the user may upload an audio file or select one of the sample inputs and download the generated audio for evaluation under the terms of the NVIDIA MAXINE EVALUATION LICENSE AGREEMENT.
References(s):
Model Architecture
Architecture Type: Convolution Neural Networks (CNNs), Transformers,
Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
Model Version: 0.2
Input:
Input Type(s): Ordered List (audio samples)
Input Format(s): FP32 (-1.0 to 1.0)
Other Properties Related to Input: Pulse Code Modulation (PCM) audio samples
with no encoding or pre-processing; 16kHz or 48kHz sampling rate required.
Output:
Output Type(s): Ordered List (audio samples)
Output Format: FP32 (-1.0 to 1.0)
Other Properties Related to Output: PCM audio samples at input sampling rate
with no encoding or post-processing.
Software Integration
Supported Hardware Platform(s): Hopper, Ada, Ampere, Turing, Volta
Test Hardware: A10, L40, T10
Supported Operating System(s): Linux, Windows
Training & Evaluation
Datasets
NVIDIA models are trained on a diverse set of public and proprietary datasets. The Studio Voice model is trained on a dataset that comprises of diverse English accents.
Link: DAPS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The DAPS dataset has 15 versions of audio (3 professional versions and
12 consumer device/real-world environment combinations). Each version consists
of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
Link: LibriTTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read
English speech, which is resampled at 16kHZ.
Link: VCTK
Properties (Quantity, Dataset Descriptions, Sensor(s)):
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with
various accents. Each speaker reads out about 400 sentences, which were selected
from a newspaper, the rainbow passage and an elicitation paragraph used for the
speech accent archive.
Link: HiFi-TTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A multi-speaker English dataset for training text-to-speech models.
The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with
at least 17 hours per speaker sampled at 44.1 kHz.
Link: Device Recorded VCTK (DR-VCTK)
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Device recorded version of VCTK dataset on common consumer devices
(laptop, tablet and smartphone) in office environment. This dataset contains
109 English speakers with different accents. There are around 400 sentences
available from each speaker. For this recording, 8 different microphones were
used. This dataset contains around 250 Gb of re-recorded speech.
Link: Dataset of impulse responses from variable acoustics room Arni at Aalto Acoustic Labs
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A dataset of impulse responses collected in the variable acoustics laboratory
Arni at Acoustics Lab of Aalto University, Espoo, Finland. IRs of 5342
configurations of sound absorption in Arni are included in the dataset. Each of
them were measured using an omnidirectional sound source and 5 sound receivers.
For each configuration, 5 impulse reponses (IRs) were captured. The total number
of measurements in the dataset is 132 037.
Link: Room Impulse Response and Noise Database
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A database of simulated and real room impulse responses, isotropic and
point-source noises. The audio files in this data are all in 16KHz sampling rate
and 16-bit precision.
Link: DNS Challenge 5
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Collated dataset of clean speech, noise and impulse response provided by
Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
Link: AudioSet
Properties (Quantity, Dataset Descriptions, Sensor(s)):
AudioSet consists of an expanding ontology of 632 audio event classes and
a collection of 2,084,320 human-labeled 10-second sound clips drawn from
YouTube videos.
Inference
Engine: Triton
Test Hardware: A10, L40, T10
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.