---
title: "audio2face-3d"
publisher: "nvidia"
type: "endpoint"
updated: "2025-06-14T16:38:20.480Z"
description: "Converts streamed audio to facial blendshapes for realtime lipsyncing and facial performances."
canonical: "https://build.nvidia.com/nvidia/audio2face-3d"
---

# Model Overview

## Description

NVIDIA Audio2Face-3D is a microservice for animating 3D character's facial characteristics to match any audio track, whether for a game, film, or real-time digital assistant. This model is designed for commercial use.

NVIDIA Audio2Emotion is embedded within Audio2Face, and it is designed to automatically recognize the emotions in human speech. These predictions are used to drive the Audio2Face avatar’s facial expressions to make it even more natural.

## Licenses

EULA information is available [here](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/). Customer will use the Software exclusively for authorized purposes, consistent with the Agreement’s terms and all applicable laws, regulations and the rights of others.

## Model Architecture

**Architecture Type:** <br>

* Audio2Face: CNN <br>
* Audio2Emotion: Transformer <br>

**Network Architecture** <br>

* Audio2Face: wav2vec2.0 <br>
* Audio2Emotion: wav2vec2.0 <br>

## Input

**Input Type(s):** Audio <br>
**Input Format:** .wav<br>
**Input Parameters: 2D:** (Tuning Parameters and Audio) <br>
**Other Properties Related to Input:** Supported Sampling rates: 22.05KHz, 44.1KHz, 16KHz; All audio is resampled to 16KHz.  There is no max audio length. <br>

## Output: <br>

**Output Type(s):** <br>

* Audio2Face: Blendshape Coefficients representing 3D facial animation throughout time<br>
* Audio2Emotion: Emotion Probability Coefficients representing 1D emotion values throughout time<br>

**Output Format:** Custom Protobuf Format <br>
**Output Parameters: 2D:** Custom Protobuf Format <br>
**Other Properties Related to Output:** N/A <br>

## Software Integration

**Runtime Engine(s):**

* DeepStream-7.1 <br>

**Supported Hardware Microarchitecture Compatibility:** <br>

* NVIDIA Ampere <br>
* NVIDIA Hopper <br>
* NVIDIA Lovelace <br>
* NVIDIA Pascal <br>
* NVIDIA Turing <br>

## Preferred/Supported Operating System(s)

* Linux <br>
* WSL <br>

## Model Versions: <br>

**Audio2Face:** <br>

* Mark v2.3 <br>
* Claire v2.3 <br>
* James v2.3 <br>

**Audio2Emotion:** <br>

* v1.0 <br>

## Training and Evaluation Dataset

Data Collection Method by dataset:  <br>

* Audio2Face: Human <br>
* Audio2Emotion: Automated <br>

Labeling Method by dataset:  <br>

* Human <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s)):** 

* Audio2Face: Multi-speaker English audio from microphone resampled at 16kHz across multiple audio types and frequency ranges. <br>
* Audio2Emotion: Multi-speaker English audio from microphone resampled at 16kHz across multiple audio types and frequency ranges. The datasets consist of multiple datasets, including RAVDESS, CREMA-D, JL, and Lindy & Rodney. Total quantity: ~18000 samples. <br>

## Inference

**Engine:** TensorRT <br>
**Test Hardware:** A100 <br>

## Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

You may not use the Software or any of its components for the purpose of emotion recognition. Any technology included in the Software may only be used as fully integrated in the Software and consistent with all applicable documentation.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://build.nvidia.com/nvidia/audio2face-3d/modelcard/). 

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Disclaimer

AI models generate responses and outputs based on complex algorithms and machine learning techniques, and those responses or outputs may be inaccurate or indecent. By testing this model, you assume the risk of any harm caused by any response or output of the model. Please do not upload any confidential information or personal data. Your use is logged for security.

## Bias

|Field                                                                                               |  Response                                                                        |
|:---------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
|Participation considerations from adversely impacted groups ([protected classes](https://www.senate.ca.gov/content/protected-classes)) in model design and testing:  |  A2F Model: Age, Gender, Race, Skin Color; A2E (Embedded in A2F) Model: Gender, Accent Variation, Voice Tone Variations |
|Measure taken to mitigate against unwanted bias:                                                   |    A2F Model: Custom dataset that was collected with a range of voices, racial backgrounds, facial structures, and performances to mitigate bias towards one facial structure. A2E (Embedded in A2F) Model: Used dataset in excess of 100 speakers with different vocal timbre, microphones, gender, and accents to make prediction more accurate for wider range of speakers. |

## Explainability

|Field                                                                                                  |  Response|
|:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
|Intended Applications & Domains:                                                                       |  Video Communication, Teleconferencing, Customer Service, and Game Avatar Creation|
|Type:                                                                                                  |  A2F Model: Facial Animation; A2E (Embedded in A2F) Model: Speech Emotion Recognition, Avatar Animation|
|Intended Users:                                                                                        |  This model is intended for developers who want to animate the face of virtual avatars in real-time, web camera-based applications.|
|Output:                                                                                                |  A2F Model: Blendshape Coefficients; A2E (Embedded in A2F) Model: Probability distribution over emotional classes.|
|Disclaimer:                                                                                            | Please do not upload any confidential information or personal data. Your use is logged for security. By testing this model, you assume the risk of any harm caused by any response or output of the model. |
|Describe how the model works:                                                                          |  Creates a vertex map of facial animation from audio frame-by-frame and converts vertex data to solved blendshape coefficients|
|Technical Limitations:                                                                                 |  Can be sensitive to varied emotion and audio quality, and emotion recognition may not work well for some voices. AI models generate responses and outputs based on complex algorithms and machine learning techniques, and those responses or outputs may be inaccurate or indecent. |
|Verified to have met prescribed NVIDIA quality standards:                                                     | Yes|
|Performance Metrics:                                                                                   | Accuracy, Visual Inspection|
|Potential Known Risks:                                                                                 | Could incorrectly emote or have incorrect lip sync. |
|Licensing:                                                                                             |   [Community Liscense for Models](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/) |

## Privacy

|Field                                                                                                                              |  Response|
|:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------|
|Generatable or reverse engineerable personally-identifiable information?                                                           |  None|
|Was consent obtained for any personal data used?                                                                                             |  A2F Model: Yes; A2E (Embedded in A2F) Model: Not Applicable|
|Protected class data used to create this model?                                                                                    |  Yes|
|How often is dataset reviewed?                                                                                                     |   Before release  |
|Is a mechanism in place to honor data subject right of access or deletion of personal data?                                        |  A2F Model: Yes; A2E (Embedded in A2F) Model: No|
|If personal data collected for the development of the model, was it collected directly by NVIDIA?                                            |  A2F Model: Yes; A2E (Embedded in A2F) Model: Not Applicable|
|If personal data collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects?  |  A2F Model: Yes; A2E (Embedded in A2F) Model: Not Applicable|
|If personal data collected for the development of this AI model, was it minimized to only what was required?                                 |  A2F Model: Yes; A2E (Embedded in A2F) Model: Not Applicable|
|Is there provenance for all datasets used in training?                                                                                                                          |  Yes|
|Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes, for internal datasets|
|Is data compliant with data subject requests for data correction or removal, if such a request was made?                           |  Yes |

## Safety & Security

|Field                                               |  Response                        |
|:---------------------------------------------------|:----------------------------------|
|Model Application(s):                               | Speech Emotion Recognition, Lip Syncing, Facial Animation|
|Describe the life-critical impacts (if present).    |  This model is not designed for life critical applications|
|Use Case Restriction(s):                            |  See [A2F License](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/)|
|Model and Dataset Restriction(s):                   |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.|