
Model Overview
Description
Maxine Background Noise Removal (BNR) model is an audio background noise removal model from NVIDIA. It removes a variety of background noises from audio recordings. It also retains emotive tones in speech, such as happy, sad, excited and angry tones
BNR is available under NVIDIA Maxine — a developer platform for deploying AI features that enhance audio, video, and creating new experiences in real-time audio-video communication. Maxine's state-of-the-art models create high-quality AI effects using standard microphones and cameras without additional special equipments.
NVIDIA Maxine is exclusively part of NVIDIA AI Enterprise for production workflows — an extensive library of full-stack software, including AI solution workflows, frameworks, pre-trained models, and infrastructure optimization.
Terms of use
The use of NVIDIA Maxine's BNR is available as a demonstration of the input and output of the BNR generative model. As such the user may upload an audio file or select one of the sample inputs and download the generated audio for evaluation under the terms of the NVIDIA MAXINE EVALUATION LICENSE AGREEMENT.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws
Reference(s):
Model Architecture:
Architecture Type: Residual Convolutional Recurrent Neural Network (CRNN)
Network Architecture: SEASR
Model Version: 1.0
Input:
Input Type(s): Ordered List (audio samples)
Input Format(s): FP32 (-1.0 to 1.0)
Other Properties Related to Input: Pulse Code Modulation (PCM) audio samples
with no encoding or pre-processing; 16kHz or 48kHz sampling rate required.
Output:
Output Type(s): Ordered List (audio samples)
Output Format: FP32 (-1.0 to 1.0)
Other Properties Related to Output: PCM audio samples at input sampling rate
with no encoding or post-processing.
Software Integration
Supported Hardware Microarchitecture Compatibility:
- Volta
- Turing
- Ada
- Ampere
- Hopper
- Blackwell
Supported Operating System(s):
- Linux
Training & Evaluation
Datasets
NVIDIA models are trained on a diverse set of public and proprietary datasets. The BNR model is trained on a wide range of English language accents, some European and Asian languanges, and 29 different noise profiles that are commonly audible in real world.
Data Collection Method by dataset: [Hybrid: Human, Synthetic]
Labeling Method by dataset: [Hybrid: Human, Synthetic]
Link: AudioSet
Properties (Quantity, Dataset Descriptions, Sensor(s)):
AudioSet consists of an expanding ontology of 632 audio event classes and
a collection of 2,084,320 human-labeled 10-second sound clips drawn from
YouTube videos.
Link: CREMA-D
Properties (Quantity, Dataset Descriptions, Sensor(s)):
CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were
from 48 male and 43 female actors between the ages of 20 and 74 coming from a
variety of races and ethnicities (African America, Asian, Caucasian, Hispanic,
and Unspecified). Actors spoke from a selection of 12 sentences. The sentences
were presented using one of six different emotions (Anger, Disgust, Fear, Happy,
Neutral, and Sad) and four different emotion levels (Low, Medium, High, and
Unspecified).
Link: Crowdsourced high-quality UK and Ireland English Dialect speech data set
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset contains male and female high quality recordings of English from
various dialects of the UK and Ireland for a total of 17,877 lines.
Link: CSR-I WSJ0
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Corpus by the DARPA Spoken Language Program to support research on large-
vocabulary Continuous Speech Recognition (CSR) systems. The first two CSR
Corpora consist primarily of read speech with texts drawn from a machine-
readable corpus of Wall Street Journal news text and are thus often known as
WSJ0 and WSJ1. WSJ0 consists of 123 speakers.
Link: CSTR VCTK
Properties (Quantity, Dataset Descriptions, Sensor(s)):
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with
various accents. Each speaker reads out about 400 sentences, which were selected
from a newspaper, the rainbow passage and an elicitation paragraph used for the
speech accent archive.
Link: DAPS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The DAPS dataset has 15 versions of audio (3 professional versions and
12 consumer device/real-world environment combinations). Each version consists
of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
Link: DEMAND
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The DEMAND (Diverse Environments Multichannel Acoustic Noise Database) contains a set of recordings that allow testing of algorithms using real-world noise in a variety of settings. This version provides 15 recordings. All recordings are made with a 16-channel array, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. It is a collection of multichannel recordings of accoustic noise in diverse environments.
Link: Edinburgh 56 speaker dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Clean and noisy parallel speech database from 56 speakers designed to train and
test speech enhancement methods that operate at 48kHz.
Link: FreeField
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A dataset of standardised 7690 10-second excerpts from Freesound field recordings.
Link: Freesound
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Freesound is a collaborative collection of 620,291 free sounds which contains speakers speaking in different emotions as well as female speakers speaking in high pitched voices. The audio data also contains few noise profiles too.
Link: GTC Dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A collection of talks from Nvidia GTC Conferences with a total of 103 speakers.
Link: HiFi-TTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
Link: LibriTTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read
English speech by 200 speakers, which is resampled at 16kHZ.
Link: Vocal set dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
VocalSet is a singing voice dataset consisting of 10.1 hours of monophonic
recorded audio of professional singers demonstrating both standard and extended
vocal techniques on all 5 vowels. Existing singing voice datasets aim to capture
a focused subset of singing voice characteristics, and generally consist of just
a few singers. VocalSet contains recordings from 20 different singers (9 male,
11 female) and a range of voice types. VocalSet aims to improve the state of
existing singing voice datasets and singing voice research by capturing not only
a range of vowels, but also a diverse set of voices on many different vocal
techniques, sung in contexts of scales, arpeggios, long tones, and excerpts.
Inference
Engine: Triton
Test Hardware: A10, A100, A16, A2, A30, A40, H100, L4, L40, RTX 4080, RTX 4090, RTX 5070, RTX 5080, RTX 5090, T4, V100
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.