nvidia/audio2face-2d
Create facial animations using a portrait photo and synchronize mouth movement with audio.
Model Overview
Description:
NVIDIA Maxine Audio2Face-2D is a generative model to create facial animations using a portrait photo and a driving audio such that the mouth movement in the photo synchronizes with the speech in the provided audio.
The model takes the input audio to estimate landmark motions that represent the mouth movements articulating the words in the audio. These landmarks are then encoded into latent representations that were passed to a generative model to animate the input portrait.
Terms of use
The use of NVIDIA Maxine Audio2Face-2D is available as a demonstration of the input and output of the Live Portrait generative model. As such the user may submit a reference “driving” audio or use the sample “driving” audio and download the generated video for evaluation under the terms of the NVIDIA MAXINE EVALUATION LICENSE AGREEMENT.
References(s):
Model Architecture:
Architecture Type: Recurrent Neural Network (RNN), Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
Input:
Input Format: RGB image (portrait photo), float vector containing 32-bit float Pulse Code Modulation (PCM) data (driving audio)
Input Parameters: 720p to 4K
Other Properties Related to Input: Input images pre-processed using proprietary technique; portrait photo supports 3 channel, 32 bit images; PCM audio samples with no encoding or pre-processing; 16kHz sampling rate and mono channel is required for audio.
Output:
Output Format: RGB image
Output Parameters: 512 x 512
Other Properties Related to Output: Input images post-processed using proprietary technique; 3 Channel, 32 bit image supported.
Software Integration:
0.8.4.0
Supported Operating System(s):
Linux
Model Version(s):
0.8.4.0
Supported Hardware Microarchitecture Compatability:
- [Volta]
- [Turing]
- [Ampere]
- [ADA]
Training and Evaluation Dataset:
Data Collection Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Datasets used in speech-live-portrait tranining are as follows:
One dataset includes 7,356 files collected of 24 professional actors (12 female and 12 male) with different expressions, head poses, and backgrounds.
One dataset consists of about 160000 videos of different speakers in different environments such as outdoor recording, indoor recording, data covering different phonemes. It is made of audio-visual data consisting of short clips of human speech, extracted from interview videos.
Evaluation Dataset:
Data Collection Method by dataset: Automated, Human
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset consists of 5000 samples. This data captures variety among different speakers, languages and phonemes.
NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing facial images of people covering different attributes such as expressions, head poses, backgrounds etc. NVIDIA is committed to the responsible development of AI Foundation models and conducts reviews of all datasets included in training.
Inference:
Engine: TensorRT, Triton
Test Hardware:
- CUDA 11.8 compatible hardware versions.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.