NVIDIA Active Speaker Detection
Active Speaker Detection is in fact a combination of multiple models that are used to perform the Active Speaker Detection effect, which processes video and multiple audio track inputs to detect, identify and track speaker identities across video frames. These include the Adaface model, Face and landmark Detector models and the SyncDiscriminator model
This feature and all associated models are available for commercial/non-commercial use.
License/Terms of use:
The use of NVIDIA Active Speaker Detection is governed by the NVIDIA SOFTWARE LICENSE AGREEMENT and Product-Specific Terms for NVIDIA AI Products.
Please see the additional sections below for each model for the governing terms of their use.
Adaface
Model Overview
Adaface generates embeddings for identifying people captured in different scenes. This model does not include biometric data.
License/Terms of use
Use of the Adaface model is governed by the NVIDIA Open Model License. Additional Information: MIT.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA model card here: CVLFace Adaface IR101 WebFace12M.
Deployment Geography:
Global
Use Case:
- Generates embeddings for identifying people captured in different scenes for video dubbing
- Based on ResNet101 architecture with Additive Angular Margin Loss
- Optimized for minimal computational overhead
- Supports various demographics, lighting conditions, and image qualities
- Produces 512-dimensional face embeddings for identification tasks
- Compatible with InsightFace framework, replaces ArcFace/AuraFace model
Release Date:
HuggingFace repo June 5, 2024
NGC: 03/12/2026 - TensorRT optimized
References
Model Architecture
Architecture Type
Convolutional Neural Network (CNN)
Network Architecture
- Backbone: IResNet101
- Loss Function: Additive Angular Margin Loss (ArcFace)
- Embedding Dimension: 512-dimensional face embeddings
- Framework: ONNX format compatible with InsightFace ArcFace, TensorRT optimized
- Number of model parameters: 65.2M
Input
Input Type
Image
Input Format
- Color Space: RGB (Red, Green, Blue)
- Image Dimensions: 112x112 pixels
- Data Type: float32 (normalized)
Input Parameters
- Batch Size: Configurable (1 or more)
- Image: Two-Dimensional (2D)
- Channels: 3 (BGR)
- layout: NHWC
Preprocessing Requirements
- Face detection and alignment
- Cropping to standard size (112x112)
- Normalization to [0, 1] range
Other Properties Related to Input
- Input images should contain a single face, properly aligned
- Images with multiple faces should be preprocessed to identify individual faces
- Optimal performance with well-lit, frontal face images
- Model can handle various lighting conditions and poses due to training data diversity
Output
Output Type
Embedding Vector
Output Format
- Embedding Dimension: 512-dimensional normalized feature vector
- Data Type: float32
- Normalization: L2-normalized (unit vector)
Output Parameters
Other Properties Related to Output
- Output embeddings can be compared using cosine similarity or Euclidean distance
- Typical similarity threshold for same person: >0.5 (cosine similarity)
- Embeddings are designed to cluster faces of the same identity closely in feature space
- Can be used for face verification, identification, and clustering tasks
Training & Evaluation Data
Training Dataset
Dataset Sources
The model was trained on a commercial dataset comprising face images from various publicly available and licensed sources:
- Commercially licensed face datasets
- Publicly available face identification benchmarks
Dataset Characteristics
AdaFace was trained on a commercial dataset comprising face images from various sources, including synthetic images. The dataset includes a wide range of demographics, lighting conditions, and image qualities to ensure robust performance across different scenarios.
- Data Collection Method:
- Hybrid: Synthetic, Undisclosed
- Labeling Method: Undisclosed
Data Preprocessing
- Normalization: All images were normalized to standard size and format
- Augmentation: Rotation, flipping, and scaling were used to improve generalization
- Face Alignment: All training faces were detected and aligned using standard landmarks
Dataset Limitations
Due to commercial licensing requirements, the training dataset may not extensively cover all global ethnicities and demographic groups. Users should conduct their own assessments to confirm the model's performance in their specific application context.
Evaluation Benchmarks
AdaFace has been tested on multiple face recognition benchmarks:
- LFW: 0.99650
- CFP-FP: 0.95186
- AGEDB: 0.96100
- CALFW: 0.94700
- CPLFW: 0.90933
Intended Use
Primary Use Cases
Media and Entertainment
- Face-based search in media libraries
- Content personalization
Bias and Fairness
Known Limitations
- Demographic Representation: The training dataset attempts to include diverse demographics, but may have limited representation of certain ethnicities due to dataset availability and commercial licensing constraints
- Performance Variation: Model efficacy in identity preservation may vary based on ethnicity, age, gender, and other demographic factors
- Lighting Sensitivity: Performance may degrade in extreme lighting conditions not well-represented in training data
Mitigation Efforts
- Training data included diverse demographics, lighting conditions, and image qualities
- Continuous monitoring and evaluation across demographic groups
- Open-source release to enable community testing and feedback
- Commercial licensing ensures proper data usage rights
Fairness Considerations
Efforts have been made to ensure that AdaFace performs equitably across different demographic groups. However, users should:
- Conduct their own fairness assessments in their specific application context
- Test model performance across relevant demographic groups for their use case
- Implement appropriate thresholds and decision-making processes
- Monitor for bias in production deployments
Privacy Considerations
Data Privacy
- AdaFace characterizes faces, which is considered sensitive personal information in many jurisdictions
- Users must ensure compliance with relevant privacy laws (GDPR, CCPA, BIPA, etc.)
- Implement appropriate data protection measures (encryption, access controls)
- Obtain necessary consent from individuals whose faces are processed
Recommendations
- Store face embeddings rather than raw images when possible
- Provide transparency to users about face identification usage
- Enable user consent and opt-out mechanisms
- Conduct Privacy Impact Assessments (PIA) for your application
Security Considerations
Potential Vulnerabilities
- Presentation Attacks: The model may be vulnerable to spoofing attacks using photos, videos, or masks (requires additional liveness detection)
- Adversarial Attacks: Like all neural networks, the model may be susceptible to adversarial perturbations
- Model Extraction: Publicly available model weights could be subject to model stealing attacks
Security Best Practices
- Implement liveness detection for security-critical applications
- Use multi-factor authentication rather than face identication alone for high-security scenarios
- Monitor for unusual patterns or potential attacks
- Keep model and dependencies updated
- Implement proper access controls for model deployment
Limitations and Recommendations
Known Limitations
- Generalization: The model's generalization is limited by the scope of the training data
- Perfect Accuracy: The model does not achieve perfect photorealism and identity consistency in all cases
- Demographic Variance: Performance may vary across different demographic groups
- Environmental Conditions: Extreme lighting, poses, or occlusions may reduce accuracy
- Performance Gap: Does not match the performance of the original ArcFace due to smaller commercial training dataset
- Real-time Requirements: May require GPU acceleration for real-time applications with high throughput
Recommendations for Users
- Validation: Conduct thorough testing in your specific deployment environment
- Threshold Tuning: Adjust similarity thresholds based on your use case
- Demographic Testing: Test across relevant demographic groups for your application
- Liveness Detection: Implement additional liveness detection for security applications
- Monitoring: Continuously monitor model performance in production
- Legal Compliance: Ensure compliance with all relevant regulations (GDPR, CCPA, BIPA, etc.)
- Human Oversight: Implement human review for high-stakes decisions
- Feedback Loop: Establish mechanisms to collect and address user feedback
Performance Optimization Tips
- Use GPU inference for batch processing and real-time applications
- Implement face detection caching for video streams
- Optimize image preprocessing pipeline
- Use TensorRT for additional performance gains on NVIDIA hardware
- Consider model quantization (FP16, INT8) for edge deployment
Model Governance
Model Maintenance
- Updates: The model is currently at version 1.0, with community feedback guiding future improvements
- Issue Reporting: Users can report issues on the HuggingFace model page or GitHub repository
- Community Contributions: The open-source nature enables community improvements and extensions
Compliance and Regulations
Users are responsible for ensuring their use of AdaFace complies with:
- Local and international privacy laws (GDPR, CCPA, BIPA, etc.)
- Biometric data regulations
- Industry-specific requirements
- Ethical AI guidelines
Changelog and Versioning
Version 1.0 (Initial Release)
- ResNet101 architecture with Additive Angular Margin Loss
- Trained on commercially available face datasets
- ONNX format for broad framework compatibility
- Compatible with InsightFace framework, replaces ArcFace
Contact and Support
Model Developers
Additional Resources
Documentation
Citations
If you use AdaFace in your research or applications, please cite:
@misc{adaface2022,
title={AdaFace: Quality Adaptive Margin for Face Recognition},
author={Minchul Kim, Anil K. Jain, Xiaoming Liu},
year={2022},
howpublished={\url{https://arxiv.org/abs/2204.00964}}
}
@inproceedings{deng2019arcface,
title={Arcface: Additive angular margin loss for deep face recognition},
author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={4690--4699},
year={2019}
}
Face and Landmarks Detector (FLD)
Model Overview
The Face and Landmarks Detector (FLD) is a high-performance face detection and facial landmark localization model based on the SCRFD (Sample and Computation Redistribution for Face Detection) architecture. SCRFD is an efficient face detector with landmark localization capabilities, designed for real-time performance while maintaining high accuracy. The model simultaneously detects faces and predicts facial keypoints, making it ideal for comprehensive facial analysis pipelines.
License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
Primary Applications
- Facial Recognition Pipelines: Face detection and landmark extraction as preprocessing for recognition/verification
- Face Alignment: Landmark-based face normalization for downstream tasks
- Facial Expression Analysis: Face detection with keypoints for emotion recognition
- Video Analytics: Real-time face and landmark detection in video streams
- Media Organization: Automated face detection with alignment for media libraries
Supported Domains
- Broadcasting and media production
- Entertainment
- Human-Computer Interaction
Release Date:
03/12/2026
Model Architecture
Architecture Details
- Backbone: ResNet-inspired efficient backbone
- Detection Framework: SCRFD (Sample and Computation Redistribution for Face Detection)
- Detection Method: Anchor-free detection with stride-based feature pyramids
- Feature Pyramid Network: Multi-scale feature extraction
- Landmark Prediction: 5-point facial landmarks (two eyes, nose, two mouth corners)
- Number of model parameters: 3.9 million (10 GFLOPS variant)
SCRFD Architecture Highlights
- Sample Redistribution: Efficient positive/negative sample assignment strategy
- Computation Redistribution: Optimized computation allocation across detection heads
- Anchor-Free Design: Direct prediction without predefined anchors
- Multi-Task Learning: Simultaneous face detection and landmark localization
Input(s):
Input Type: Image
Input Format:
- Input name:
input_image
- Color format: Red, Green, Blue (RGB)
- Data type: fp32, fp16
- Shape: [1, 3, 640, 640] (NCHW format)
Input Parameters:
- Image: Two-Dimensional (2D) (width, height) = (640, 640)
- Pre-processing:
- Division by 255.0
- Resize image to (640, 640) using bilinear interpolation, preserve aspect ratio, zero-padding
- Mean normalization: [127.5, 127.5, 127.5]
- Standard deviation: [128.0, 128.0, 128.0]
Output(s)
Output Type(s):
For each detected face:
- Bounding boxes (x1, y1, x2, y2) in image space
- Confidence score
- 5 facial landmarks (x, y coordinates for each keypoint: eye centers, nose, mouth corners)
Output Format(s):
- Bounding Box: 4 fp32/fp16 values per detection
- Confidence score: 1 fp32/fp16 value per detection
- Landmarks: 10 fp32/fp16 values per detection (5 points × 2D coordinates)
Output Parameters:
bboxes: Nx4 fp32/fp16 array (N detections, 4 coordinates each)
scores: Nx1 fp32/fp16 array (N confidence scores)
landmarks: Nx10 fp32/fp16 array (N detections, 10 landmark coordinates)
Post-processing
-
Score Thresholding: Applied to filter low-confidence detections (default: 0.5)
-
Non-Maximum Suppression (NMS): Applied with default IoU threshold of 0.4
-
Landmark Refinement: Direct landmark coordinate prediction in image space
-
Optional Face Selection: Supports limiting detections to top-N faces by confidence or area
Performance
Inference Characteristics
- Optimized Runtime: TensorRT (NVIDIA), ONNX Runtime
- Supported Hardware: NVIDIA GPUs CC 7.5 and up with TensorRT support
- Precision: FP32 (supports FP16/INT8 optimization)
- Typical Latency: Real-time performance on modern NVIDIA GPUs (~5-10ms on RTX 3080)
- Memory Footprint: Lightweight design optimized for edge and datacenter deployment
Detection Metrics
- Default Confidence Threshold: 0.50
- Default IoU Threshold: 0.40 (NMS)
- Detection Capabilities: Multiple face detection per frame with landmarks
- Landmark Points: 5 keypoints (left eye, right eye, nose tip, left mouth corner, right mouth corner)
- Face Selection Modes:
max: Select faces by maximum area
center: Select faces closest to image center with area weighting
Limitations
Known Constraints
- Occlusion Sensitivity: Detection and landmark accuracy may degrade with partial face occlusions (masks, sunglasses, hands)
- Lighting Conditions: Reduced accuracy in extreme low-light or high-contrast scenarios
- Face Pose: Optimized for frontal and near-frontal faces; extreme profile views (>60° yaw) may have lower detection rates and less accurate landmarks
- Face Size: Performance varies based on face size in the image; very small faces (<20×20 pixels) may not be detected reliably
- Landmark Accuracy: 5-point landmarks are sufficient for alignment but not for detailed facial analysis requiring dense landmarks
Bias and Fairness Considerations
- Model performance may vary across different demographic groups depending on training data composition
- Users should evaluate model performance on their specific target populations
- Consider using diverse test datasets to assess fairness metrics across age, gender, ethnicity, and skin tone
Training Data
Dataset Characteristics
The model has been trained on diverse face detection and landmark datasets including:
- Multiple age groups (children, adults, elderly)
- Various ethnicities and skin tones
- Different lighting conditions (indoor, outdoor, natural, artificial)
- Indoor and outdoor scenarios
- Various face poses and expressions
- Occluded and non-occluded faces
Data Collection Method by dataset
- Hybrid: Human, Synthetic images and videos
Labeling Method by dataset
- Hybrid: Human annotators, Automatic/Sensors
Note: Training typically involves publicly available datasets such as WIDER FACE, and additional proprietary datasets.
Inference
- Acceleration Engine: TensorRT 10.9 - 10.13, ONNX Runtime 1.15+
- Test Hardware: Tesla T4, A100, GeForce RTX 3080, GeForce RTX 4090, L40, GH100, Jetson AGX Orin
Key Considerations
- Privacy: Ensure compliance with applicable privacy laws and regulations (GDPR, CCPA, BIPA, etc.)
- Consent: Obtain necessary consent for face detection and biometric data collection in applicable jurisdictions
- Bias Mitigation: Regularly evaluate and monitor for demographic performance disparities
- Transparency: Inform end-users when face detection and landmark extraction technology is in use
- Data Retention: Establish policies for retention and deletion of facial data
References
Technical Papers
- SCRFD: Guo, J., Deng, J., Lattas, A., & Zafeiriou, S. (2021). "Sample and Computation Redistribution for Efficient Face Detection." arXiv preprint arXiv:2105.04714. Paper Link
Related Resources
Biometric Data Considerations
This model extracts facial landmarks which may be considered biometric data under certain regulations (e.g., BIPA in Illinois). Users must:
- Obtain informed consent before collecting biometric data
- Clearly disclose the purpose and duration of data storage
- Implement appropriate security measures to protect biometric data
- Provide mechanisms for data subject rights (access, deletion)
SyncDiscriminator
Model Overview:
The SyncDiscriminator model determines whether a visible person in a video frame is actively speaking, using both audio and visual cues. The model processes face crops from video frames alongside corresponding audio Mel-Frequency Cepstral Coefficient (MFCC) features to produce a per-frame speaking detection score.
This model is ready for commercial/non-commercial use.
License/Terms of Use
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License. Additional Information: MIT.
Deployment Geography:
Global
Use Case:
NVIDIA AR SDK developers building video communication and content creation applications that require identifying which person in a scene is currently speaking.
Release Date:
NGC - 03/12/2026 via URL
References(s):
Model Architecture:
Architecture Type: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN)
Network Architecture: 3D CNN, 2D CNN, Gated Recurrent Unit (GRU)
- This model was developed based on a lightweight audio-visual architecture with separate visual and audio encoders, gated concatenation fusion, and a bidirectional GRU temporal detector.
- Number of model parameters 8.4 x 10^5
Input(s):
Input Type(s): Video (Face Crops), Audio
Input Format(s):
- Video: Grayscale
- Audio: MFCC (Mel-Frequency Cepstral Coefficients)
Input Parameters:
- Video: Three-Dimensional (3D)
- Audio: Two-Dimensional (2D)
Other Properties Related to Input: Video: Grayscale face crops resized to 112x112 pixels with zero-centered pixel intensity normalization; variable length up to 6 seconds, variable frame rates supported. Audio: MFCC coefficients extracted to yield 4 MFCC frames per video frame.
Output(s):
Output Type(s): Score
Output Format(s):
Output Parameters:
Other Properties Related to Output: Per-frame active speaker detection score (raw logit). A score greater than 0.0 indicates the person is speaking. One score is produced per input video frame.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Training and Evaluation Datasets:
- The total size (in number of data points): 28,726 training tracks and 7,753 validation tracks (AVA-ActiveSpeaker), plus 6,635 evaluation clips (VoxCeleb2/TalkSet) and Columbia ASD (5 speakers)
- Total number of datasets: 4 (AVA-ActiveSpeaker train, AVA-ActiveSpeaker validation, Columbia ASD, VoxCeleb2/TalkSet)
- Dataset partition: Training [79%], evaluation [21%] (AVA train/val split; Columbia ASD and VoxCeleb2/TalkSet are evaluation-only datasets)
Training Dataset:
- AVA-ActiveSpeaker Dataset
Data Collection Method by dataset
Hybrid: Human and Automated
Labeling Method by dataset
Hybrid: Human and Automated
Properties:
The AVA-ActiveSpeaker dataset contains face tracks and corresponding audio from Hollywood movie clips. Each face track is annotated with a binary speaking/not-speaking label at the frame level. The training set contains approximately 3.6 million entity-level annotations across 120 15-minute movie segments. Face crops are extracted using S3face detection and tracking.
Data Modality
Video Training Data Size
Evaluation Dataset:
- AVA-ActiveSpeaker Validation Set
- Columbia ASD Dataset
- VoxCeleb2/TalkSet Validation Set
Benchmark Score:
AVA validation: mAP = 94.44%. Columbia ASD: F1 = 87.23% (averaged across 5 speakers, evaluated with 3-second duration).
Data Collection Method by dataset
Hybrid: Human and Automated
Labeling Method by dataset
Hybrid: Human and Automated
Properties:
AVA validation set contains 7,753 tracks across ~33 movie segments with frame-level annotations. Columbia ASD contains panel discussion videos with 5 speakers. VoxCeleb2/TalkSet validation set contains 6,635 clips (TAudio: 2,494, FAudio: 2,072, TFAudio: 2,069) derived from VoxCeleb2 celebrity interview videos with constructed speaking/non-speaking scenarios.
Software Integration:
Runtime Engine(s):
NVIDIA AR SDK
Supported Hardware Microarchitecture Compatability:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Turing
Preferred/Supported Operating System(s):
- Ubuntu 20.04
- Ubuntu 22.04
- Ubuntu 24.04
- Debian 12
- Rocky/RHEL 8.*
- Rocky/RHEL 9.*
- Windows 10
- Windows 11
Inference:
Engine: TensorRT, Triton
Test Hardware:
Desktops and Servers with following GPU architectures:
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Ampere
- NVIDIA Turing
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Contact and Support
For questions, issues, or support: