NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

Lipsync

Downloadable

TODO

Nvidia Maxine
Get API Key
API ReferenceAPI Reference
Accelerated by DGX Cloud

Model Overview

Description:

Lip Sync is a generative model to lip sync a video containing a human face to a target audio containing human speech such that the mouth movements of the generated video match with the speech in the provided audio.

The model takes a speech segment and an image containing a human face as input and morphs the lip movements to generate an image where lips sync with the input audio. The head pose, background and the quality of the output image are kept the same as the input image. This model is ready for commercial use.

License/Terms of Use

NVIDIA Software and Model Evaluation License

Deployment Geography:

Global

Use Case:

Maxine AR SDK developers for enabling LipSync capabilities to content localization workflows.

Release Date:

NGC [03/12/2026] via URL

References(s):

NVIDIA Maxine

Model Architecture:

Architecture Type: Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)

Network Architecture: Encoder-Decoder

** Number of model parameters: 2.5*10^8

Computational Load (For NVIDIA Models Only)

Cumulative Compute: 1.45974 x 10^21 FLOPs
Estimated Energy and Emissions for Model Training:
Estimated Energy for model training = 1074.136 kWh
Emissions for model training = 0.44093 tCO2e

Input:

Input Type(s): Audio, Image
Input Format: Float vector containing 32-bit float Pulse Code Modulation (PCM) data, RGB Image
Input Parameters: Audio (1D), Image (2D)
Other Properties Related to Input: Up to 4K image resolution. 16kHz sampling rate for audio. The input image should have 1 complete face and only 1 face. The input audio should have a single channel (mono). Input images are pre-processed using proprietary technique and PCM audio samples are expected without any encoding or pre-processing.

Output:

Output Type(s): Image
Output Format: Image: Red, Green, Blue (RGB)
Output Parameters: 2D. Same resolution as input image.
Other Properties Related to Output: Output images post-processed using proprietary technique to blend the lip synced face image to original frame.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Maxine AR SDK 1.1.0.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Turing

Preferred/Supported Operating System(s):

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • Debian 12
  • Rocky/RHEL 8.*
  • Rocky/RHEL 9.*
  • Windows 10
  • Windows 11

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

Maxine AR SDK 1.1.0.0

Training and Evaluation Dataset :

Training Dataset

Link:

  • VoxCeleb dataset
  • VFHQ dataset
  • HDVILA dataset

** Data Modality

  • Audio
  • Video

** Audio Training Data Size

  • Less than 10,000 Hours

** Video Training Data Size

  • Less than 10,000 Hours

** Data Collection Method by dataset

  • Automated

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): Training dataset contains approximately 160,000 videos of different speakers in different environments such as outdoor recording, indoor recording, data covering different phonemes. It is made of audio-visual data consisting of short clips of human speech.


Evaluation Dataset

Link:

  • Internal Capture
  • Publicly available internet scale data

** Data Collection Method by dataset

  • Hybrid: Automated and Human

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): Around 100 videos with various lengths of a single person in front of the camera conducting a video conference or broadcasting. The dataset varies in terms of quality, lighting, head pose, gaze angles and other diversity factors such as race, eye color, and gender. The audios in the videos have been translated to 9 languages.


Inference:

Acceleration Engine: Tensor(RT), Triton
Test Hardware:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Turing

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.