LipSync

LipSync is a generative model to lip sync a video containing a human face to a target audio containing human speech such that the mouth movements of the generated video match with the speech in the provided audio.

The model takes a speech segment and an image containing a human face as input and morphs the lip movements to generate an image where lips sync with the input audio. The head pose, background and the quality of the output image are kept the same as the input image.

This model is ready for commercial use. To try out this model, you must join the Private Access Program. You may request access at Private Access Program.

Using this model

To download the model, use the feature installation script bundled with the NVIDIA AR SDK (features/install_feature.[ps1|sh]). Please refer to the LipSync Collection page or the AR SDK Documentation for more details. You can navigate to the collection page by click on the Related Collections tab at the top of this page.

License/Terms of use:

Use of the NVIDIA LipSync models is governed by the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

NVIDIA AR SDK developers for enabling LipSync capabilities to content localization workflows.

Release Date:

10/17/2025

References(s):

NVIDIA AR SDK Developer Page

Model Architecture:

Architecture Type: Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)
Network Architecture: Encoder-Decoder
Number of model parameters: 2.5*10^8

Input:

Input Type(s): Image, Audio
Input Format: Float vector containing 32-bit float Pulse Code Modulation (PCM) data, RGB Image
Input Parameters: Audio (1D), Image (2D)
Other Properties Related to Input: Up to 4K image resolution. 16kHz sampling rate for audio. The input image should have 1 complete face and only 1 face. The input audio should have a single channel (mono). Input images are pre-processed using proprietary technique and PCM audio samples are expected without any encoding or pre-processing.

Output:

Output Type(s): Image
Output Format: Image: Red, Green, Blue (RGB)
Output Parameters: 2D. Same resolution as input image.
Other Properties Related to Output: Output images post-processed using proprietary technique to blend the lip synced face image to original frame.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

NVIDIA AR SDK

Supported Hardware Microarchitecture Compatability:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Turing

Preferred/Supported Operating System(s):

Ubuntu 20.04
Ubuntu 22.04
Ubuntu 24.04
Debian 12
Rocky/RHEL 8.*
Rocky/RHEL 9.*
Windows 10
Windows 11

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Training and Evaluation Dataset:

Training Dataset:

Datasets used:

VoxCeleb dataset
VFHQ dataset
HDVILA dataset

Data Modality:

Audio
Video

Audio Training Data Size:
Less than 10,000 Hours

Video Training Data Size:
Less than 10,000 Hours

Data collection method by dataset:
Automated

Labeling method by dataset:
Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): Training dataset contains approximately 160,000 videos of different speakers in different environments such as outdoor recording, indoor recording, data covering different phonemes. It is made of audio-visual data consisting of short clips of human speech.

Evaluation Dataset:

Datasets used:

Internal captures
YouTube videos

Data Collection Method by dataset:
Hybrid: Automated and Human

Labeling method by dataset:
Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): Around 40 videos with various lengths of a single person in front of the camera conducting a video conference or broadcasting. The dataset varies in terms of quality, lighting, head pose, gaze angles and other diversity factors such as race, eye color, and gender. The audios in the videos have been translated to 5 languages.

Inference:

Engine: TensorRT, Triton
Test Hardware: Desktops and Servers with following GPU architectures:

NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Ampere
NVIDIA Turing

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

nvidia

LipSync

LipSync

Using this model

License/Terms of use:

Deployment Geography:

Use Case:

Release Date:

References(s):

Model Architecture:

Input:

Output:

Software Integration:

Training and Evaluation Dataset:

Training Dataset:

Evaluation Dataset:

Inference:

Ethical Considerations: