nvidia/eyecontact
Estimate gaze angles of a person in a video and redirect to make it frontal.
Model Overview
Description:
The Maxine Eye Contact model redirects eye gaze for video conference applications.
The model estimates the gaze direction of the input eye gaze and synthesizes a redirected gaze using a region of interest around one’s eyes known as an eye patch.
The encoder encodes the image’s contents into latent representations for eye gaze, head pose and environmental conditions.
A transformation is applied to these latent representations to align them with the redirection angle provided by the user. The transformed representations when fed to the decoder result in the generation of an output eye patch with eyes redirected in the desired angle.
More information about eye contact can be found in the developer blog here.
## Terms of use
The use of NVIDIA Maxine Eye Contact is available as a demonstration of the input and output of the Gaze Estimation transformer model. As such the user may submit a reference "video" and download the generated gaze redirected video for evaluation under the terms of the NVIDIA MAXINE EVALUATION LICENSE AGREEMENT.
References(s):
- Rochelle, et al. “Improve Human Connection in Video Conferences with NVIDIA Maxine Eye Contact”, Jan 21, 2023, NVIDIA Technical Blog.
Model Architecture:
Architecture Type: Convolution Neural Network (CNN)
Network Architecture: Encoder-Decoder
The network architecture includes a transforming encoder and decoder network. The encoder encodes the image’s contents into latent representations for (a) the image’s non-subject related factors(e.g., environmental lighting, shadows, image white balance and hue, blurriness); (b) subjected-related factors (e.g., skin color, face/eye shape, eye glasses); (c) eye gaze and (d) head pose.
A rotation applied to the individual latent factors affects a corresponding change in the appearance of the image to alter the factor. In this application, we apply a rotation to the gaze related latent factor. The transformed latent factors are fed to the decoder to generate the transformed version of the input with a gaze redirected angle.
Input:
Input Type(s): Image: Angle vector
Input Format(s): Image: Red, Green, Blue (RGB), Angle vector: Radians
Input Parameters: Input Image:(256 x 64) 2D redirection angle vector [2,1]
Other Properties Related to Input: Input image of resolution 256 x 64 normalized RGB in range of [0, 1]. Redirection gaze angle is a float32 value representing the pitch and yaw angles in radians.
Output:
Output Type(s): Image, Estimated Angle Vector,Landmarks Vector
Output Format: Image: Red, Green, Blue (RGB), Estimated Angle Vector: Radians, Landmarks Vector: Float32
Output Parameters: Image: [256 x64], Estimated Angle Vector[2,1] , Landmarks Vector [2,1]
Other Properties Related to Output: Output image is normalized by dividing by 255. Estimated gaze angle is a float32 representing the pitch and yaw angles in radians. Redirected eye landmarks are represented by 12 fiducial points marking the outline of the eyes, each value represented by x and y positions of the point in float32 format.
Software Integration:
0.8.4.0
Supported Operating System(s):
- Ubuntu 18.04
- Ubuntu 20.04
- Ubuntu 22.04
- Debian 11
- Rocky 8.7
- Windows 10
- Windows 11
Model Version(s):
[Maxine- 0.8.4]
Supported Hardware Microarchitecture Compatability:
- [Volta]
- [Turing]
- [Ampere]
- [ADA]
- [Hopper]
Training and Evaluation Dataset:
Data Collection Method by dataset:
- Hybrid: Human and Synthetic
** Labeling Method by dataset:**
- Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Training dataset contains approximately 2 million images of rectangular eye patches of people captured with different environmental conditions, head poses and gaze angles. The dataset contain a combination of real data and synthetic data.
Evaluation Dataset:
Data Collection Method by dataset: Automated, Human
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Internally-captured dataset of 65 videos, each between 30 seconds to two (2) minutes of a single person between 1 to 3 feet in front of the camera conducting a video conference. The dataset varies in terms of quality, lighting, head pose, gaze angles and other diversity factors such as race, eye color, and gender.
Inference:
Engine: TensorRT, Triton
Test Hardware:
- CUDA 11.8 compatible hardware versions of Desktop and Servers.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.