
LipSync is a generative model to lip sync a video containing a human face to a target audio containing human speech such that the mouth movements of the generated video match with the speech in the provided audio.
The model takes a speech segment and an image containing a human face as input and morphs the lip movements to generate an image where lips sync with the input audio. The head pose, background and the quality of the output image are kept the same as the input image.
This model is ready for commercial use. To try out this model, you must join the Private Access Program. You may request access at Private Access Program.
To download the model, use the feature installation script bundled with the NVIDIA AR SDK (features/install_feature.[ps1|sh]). Please refer to the LipSync Collection page or the AR SDK Documentation for more details. You can navigate to the collection page by click on the Related Collections tab at the top of this page.
Use of the NVIDIA LipSync models is governed by the NVIDIA Open Model License.
Global
NVIDIA AR SDK developers for enabling LipSync capabilities to content localization workflows.
10/17/2025
Architecture Type: Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
Number of model parameters: 2.5*10^8
Input Type(s): Image, Audio
Input Format: Float vector containing 32-bit float Pulse Code Modulation (PCM) data, RGB Image
Input Parameters: Audio (1D), Image (2D)
Other Properties Related to Input: Up to 4K image resolution. 16kHz sampling rate for audio. The input image should have 1 complete face and only 1 face. The input audio should have a single channel (mono). Input images are pre-processed using proprietary technique and PCM audio samples are expected without any encoding or pre-processing.
Output Type(s): Image
Output Format: Image: Red, Green, Blue (RGB)
Output Parameters: 2D. Same resolution as input image.
Other Properties Related to Output: Output images post-processed using proprietary technique to blend the lip synced face image to original frame.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
NVIDIA AR SDK
Supported Hardware Microarchitecture Compatability:
Preferred/Supported Operating System(s):
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Datasets used:
Data Modality:
Audio Training Data Size:
Less than 10,000 Hours
Video Training Data Size:
Less than 10,000 Hours
Data collection method by dataset:
Automated
Labeling method by dataset:
Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Training dataset contains approximately 160,000 videos of different speakers in different environments such as outdoor recording, indoor recording, data covering different phonemes. It is made of audio-visual data consisting of short clips of human speech.
Datasets used:
Data Collection Method by dataset:
Hybrid: Automated and Human
Labeling method by dataset:
Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Around 40 videos with various lengths of a single person in front of the camera conducting a video conference or broadcasting. The dataset varies in terms of quality, lighting, head pose, gaze angles and other diversity factors such as race, eye color, and gender. The audios in the videos have been translated to 5 languages.
Engine: TensorRT, Triton
Test Hardware:
Desktops and Servers with following GPU architectures:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.