nvidia/audio2face-3d
Converts streamed audio to facial blendshapes for realtime lipsyncing and facial performances.
Model Overview
Description
NVIDIA Audio2Face-3D is a microservice for animating 3D character's facial characteristics to match any audio track, whether for a game, film, or real-time digital assistant. This model is designed for commercial use.
NVIDIA Audio2Emotion is embedded within Audio2Face, and it is designed to automatically recognize the emotions in human speech. These predictions are used to drive the Audio2Face avatar’s facial expressions to make it even more natural.
Licenses
EULA information is available here. Customer will use the Software exclusively for authorized purposes, consistent with the Agreement’s terms and all applicable laws, regulations and the rights of others.
Model Architecture
Architecture Type:
- Audio2Face: CNN
- Audio2Emotion: Transformer
Network Architecture
- Audio2Face: wav2vec2.0
- Audio2Emotion: wav2vec2.0
Input
Input Type(s): Audio
Input Format: .wav
Input Parameters: 2D: (Tuning Parameters and Audio)
Other Properties Related to Input: Supported Sampling rates: 22.05KHz, 44.1KHz, 16KHz; All audio is resampled to 16KHz. There is no max audio length.
Output:
Output Type(s):
- Audio2Face: Blendshape Coefficients representing 3D facial animation throughout time
- Audio2Emotion: Emotion Probability Coefficients representing 1D emotion values throughout time
Output Format: Custom Protobuf Format
Output Parameters: 2D: Custom Protobuf Format
Other Properties Related to Output: N/A
Software Integration
Runtime Engine(s):
- DeepStream-7.1
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
Preferred/Supported Operating System(s)
- Linux
- WSL
Model Versions:
Audio2Face:
- Mark v2.3
- Claire v2.3
- James v2.3
Audio2Emotion:
- v1.0
Training and Evaluation Dataset
Data Collection Method by dataset:
- Audio2Face: Human
- Audio2Emotion: Automated
Labeling Method by dataset:
- Human
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- Audio2Face: Multi-speaker English audio from microphone resampled at 16kHz across multiple audio types and frequency ranges.
- Audio2Emotion: Multi-speaker English audio from microphone resampled at 16kHz across multiple audio types and frequency ranges. The datasets consist of multiple datasets, including RAVDESS, CREMA-D, JL, and Lindy & Rodney. Total quantity: ~18000 samples.
Inference
Engine: TensorRT
Test Hardware: A100
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Disclaimer
AI models generate responses and outputs based on complex algorithms and machine learning techniques, and those responses or outputs may be inaccurate or indecent. By testing this model, you assume the risk of any harm caused by any response or output of the model. Please do not upload any confidential information or personal data. Your use is logged for security.