nvidia

bnr

Run Anywhere

Removes unwanted noises from audio improving speech intelligibility.

digital human nvidia maxine speech enhancement speech-to-speech

Get API Key

API Reference

Model Overview

Description

Maxine Background Noise Removal (BNR) model is an audio background noise removal model from NVIDIA. It removes a variety of background noises from audio recordings. It also retains emotive tones in speech, such as happy, sad, excited and angry tones

BNR is available under NVIDIA Maxine — a developer platform for deploying AI features that enhance audio, video, and creating new experiences in real-time audio-video communication. Maxine's state-of-the-art models create high-quality AI effects using standard microphones and cameras without additional special equipments.

NVIDIA Maxine is exclusively part of NVIDIA AI Enterprise for production workflows — an extensive library of full-stack software, including AI solution workflows, frameworks, pre-trained models, and infrastructure optimization.

Terms of use

The use of NVIDIA Maxine's BNR is available as a demonstration of the input and output of the BNR generative model. As such the user may upload an audio file or select one of the sample inputs and download the generated audio for evaluation under the terms of the NVIDIA MAXINE EVALUATION LICENSE AGREEMENT.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws

Reference(s):

NVIDIA Maxine

Model Architecture:

Architecture Type: Residual Convolutional Recurrent Neural Network (CRNN)

Network Architecture: SEASR

Model Version: 1.0

Input:

Input Type(s): Ordered List (audio samples)

Input Format(s): FP32 (-1.0 to 1.0)

Other Properties Related to Input: Pulse Code Modulation (PCM) audio samples

with no encoding or pre-processing; 16kHz or 48kHz sampling rate required.

Output:

Output Type(s): Ordered List (audio samples)

Output Format: FP32 (-1.0 to 1.0)

Other Properties Related to Output: PCM audio samples at input sampling rate

with no encoding or post-processing.

Software Integration

Supported Hardware Microarchitecture Compatibility:

Volta
Turing
Ada
Ampere
Hopper
Blackwell

Supported Operating System(s):

Linux

Training & Evaluation

Datasets

NVIDIA models are trained on a diverse set of public and proprietary datasets. The BNR model is trained on a wide range of English language accents, some European and Asian languanges, and 29 different noise profiles that are commonly audible in real world.

Data Collection Method by dataset: [Hybrid: Human, Synthetic]

Labeling Method by dataset: [Hybrid: Human, Synthetic]

Link: AudioSet
Properties (Quantity, Dataset Descriptions, Sensor(s)):
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Link: CREMA-D
Properties (Quantity, Dataset Descriptions, Sensor(s)):
CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified).

Link: Crowdsourced high-quality UK and Ireland English Dialect speech data set
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset contains male and female high quality recordings of English from various dialects of the UK and Ireland for a total of 17,877 lines.

Link: CSR-I WSJ0
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Corpus by the DARPA Spoken Language Program to support research on large- vocabulary Continuous Speech Recognition (CSR) systems. The first two CSR Corpora consist primarily of read speech with texts drawn from a machine- readable corpus of Wall Street Journal news text and are thus often known as WSJ0 and WSJ1. WSJ0 consists of 123 speakers.

Link: CSTR VCTK
Properties (Quantity, Dataset Descriptions, Sensor(s)):
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Link: DAPS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).

Link: DEMAND
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The DEMAND (Diverse Environments Multichannel Acoustic Noise Database) contains a set of recordings that allow testing of algorithms using real-world noise in a variety of settings. This version provides 15 recordings. All recordings are made with a 16-channel array, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. It is a collection of multichannel recordings of accoustic noise in diverse environments.

Link: Edinburgh 56 speaker dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Clean and noisy parallel speech database from 56 speakers designed to train and test speech enhancement methods that operate at 48kHz.

Link: FreeField
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A dataset of standardised 7690 10-second excerpts from Freesound field recordings.

Link: Freesound
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Freesound is a collaborative collection of 620,291 free sounds which contains speakers speaking in different emotions as well as female speakers speaking in high pitched voices. The audio data also contains few noise profiles too.

Link: GTC Dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A collection of talks from Nvidia GTC Conferences with a total of 103 speakers.

Link: HiFi-TTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.

Link: LibriTTS
Properties (Quantity, Dataset Descriptions, Sensor(s)):
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech by 200 speakers, which is resampled at 16kHZ.

Link: Vocal set dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)):
VocalSet is a singing voice dataset consisting of 10.1 hours of monophonic recorded audio of professional singers demonstrating both standard and extended vocal techniques on all 5 vowels. Existing singing voice datasets aim to capture a focused subset of singing voice characteristics, and generally consist of just a few singers. VocalSet contains recordings from 20 different singers (9 male, 11 female) and a range of voice types. VocalSet aims to improve the state of existing singing voice datasets and singing voice research by capturing not only a range of vowels, but also a diverse set of voices on many different vocal techniques, sung in contexts of scales, arpeggios, long tones, and excerpts.

Inference

Engine: Triton
Test Hardware: A10, A100, A16, A2, A30, A40, H100, L4, L40, RTX 4080, RTX 4090, RTX 5070, RTX 5080, RTX 5090, T4, V100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.