
Lip Sync is a generative model to lip sync a video containing a human face to a target audio containing human speech such that the mouth movements of the generated video match with the speech in the provided audio.
The model takes a speech segment and an image containing a human face as input and morphs the lip movements to generate an image where lips sync with the input audio. The head pose, background and the quality of the output image are kept the same as the input image. This model is ready for commercial use.
NVIDIA Software and Model Evaluation License
Global
Maxine AR SDK developers for enabling LipSync capabilities to content localization workflows.
NGC [03/12/2026] via URL
Architecture Type: Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
** Number of model parameters: 2.5*10^8
Cumulative Compute: 1.45974 x 10^21 FLOPs
Estimated Energy and Emissions for Model Training:
Estimated Energy for model training = 1074.136 kWh
Emissions for model training = 0.44093 tCO2e
Input Type(s): Audio, Image
Input Format: Float vector containing 32-bit float Pulse Code Modulation (PCM) data, RGB Image
Input Parameters: Audio (1D), Image (2D)
Other Properties Related to Input: Up to 4K image resolution. 16kHz sampling rate for audio. The input image should have 1 complete face and only 1 face. The input audio should have a single channel (mono). Input images are pre-processed using proprietary technique and PCM audio samples are expected without any encoding or pre-processing.
Output Type(s): Image
Output Format: Image: Red, Green, Blue (RGB)
Output Parameters: 2D. Same resolution as input image.
Other Properties Related to Output: Output images post-processed using proprietary technique to blend the lip synced face image to original frame.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Preferred/Supported Operating System(s):
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Maxine AR SDK 1.1.0.0
Link:
** Data Modality
** Audio Training Data Size
** Video Training Data Size
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)): Training dataset contains approximately 160,000 videos of different speakers in different environments such as outdoor recording, indoor recording, data covering different phonemes. It is made of audio-visual data consisting of short clips of human speech.
Link:
** Data Collection Method by dataset
** Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s)): Around 100 videos with various lengths of a single person in front of the camera conducting a video conference or broadcasting. The dataset varies in terms of quality, lighting, head pose, gaze angles and other diversity factors such as race, eye color, and gender. The audios in the videos have been translated to 9 languages.
Acceleration Engine: Tensor(RT), Triton
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.