Active Speaker Detection Model by NVIDIA

Follow the steps below to download and run the NVIDIA NIM inference microservice for this model on your infrastructure of choice.

Step 1
Generate API Key

Step 2
Pull and Run the NIM

$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

NVIDIA Active Speaker Detection NIM uses gRPC APIs for inferencing requests.

A NGC API KEY is required to download the appropriate models and resources when starting the NIM.

If you are not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal:

export NGC_API_KEY=<PASTE_API_KEY_HERE>

Run one of the following commands to make the key available at startup:

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE, or using a password manager.

The following command launches the NVIDIA Active Speaker Detection NIM with the gRPC service. Find reference to runtime parameters for the container here.

docker run -it --rm --name=active-speaker-detection-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_MANIFEST_PROFILE=$MANIFEST_PROFILE_ID \
  -e NIM_HTTP_API_PORT=8000 \
  -e NIM_GRPC_API_PORT=8001 \
  -p 8000:8000 \
  -p 8001:8001 \
  nvcr.io/nim/nvidia/active-speaker-detection:latest

Ensure you use the appropriate MANIFEST_PROFILE_ID for your GPU. For more information about MANIFEST_PROFILE_ID, refer to the the Model Manifest Profiles Table.

`MANIFEST_PROFILE_ID` is an optional parameter. If the manifest profile ID is not supplied, the NIM automatically selects a matching profile ID based on the target hardware architecture.

However, if `MANIFEST_PROFILE_ID` is used, ensure that the associated GPU architecture is compatible with the target hardware. If an incorrect manifest profile ID is used, a deserialization error occurs on inference.

Please note, the flag --gpus all is used to assign all available GPUs to the docker container. This fails on multiple GPU unless all GPUs are same. To assign specific GPU to the docker container (in case of different multiple GPUs available in your machine) use --gpus '"device=0,1,2..."'

If the command runs successfully, you will get an output ending similar to the following:

I1027 22:31:44.952125 123 grpc_server.cc:2560] "Started GRPCInferenceService at 127.0.0.1:9001"
I1027 22:31:44.952247 123 http_server.cc:4755] "Started HTTPService at 127.0.0.1:9000"
I1027 22:31:44.993329 123 http_server.cc:358] "Started Metrics Service at 127.0.0.1:9002"
Triton server is ready
[INFO AI4M BASE LOGGER 2026-03-11 10:55:57.487 PID:195] Listening to 0.0.0.0:8001

By default, the Active Speaker Detection NIM gRPC service is hosted on port 8001. You must use this port for inferencing requests. The port is configurable via the NIM_GRPC_API_PORT environment variable.

We have provided a sample client script file in our GitHub repo. The script could be used to invoke the Docker container using the following instructions.

Download the Active Speaker Detection NIM client code by cloning the gRPC client repository (NVIDIA-Maxine/nim-clients):

git clone https://github.com/NVIDIA-Maxine/nim-clients.git

# Go to the 'active-speaker-detection' folder
cd nim-clients/active-speaker-detection/

Install the required dependencies.

For the Python client:

# Install Python on Linux
sudo apt-get install python3-pip
pip install -r requirements.txt

cd scripts

python active_speaker_detection.py \
    --target 127.0.0.1:8001 \
    --video-input ../assets/sample_video_streamable.mp4 \
    --audio-input ../assets/sample_audio.wav \
    --diarization-input ../assets/sample_diarization.json \
    --output speaker_detection_output.mp4

The output video will have green bounding boxes on speaking faces and red on non-speaking faces.

To skip sending a separate audio stream (use audio embedded in the video):

python active_speaker_detection.py \
    --target 127.0.0.1:8001 \
    --video-input ../assets/sample_video_streamable.mp4 \
    --diarization-input ../assets/sample_diarization.json \
    --skip-audio

Command Line Arguments

Argument	Description	Default
`--target`	IP of gRPC service.	`127.0.0.1:8001`
`--video-input`	Path to input video file (MP4 format).	`../assets/sample_video_streamable.mp4`
`--audio-input`	Path to input audio file (WAV/MP3/OPUS format).	`../assets/sample_audio.wav`
`--diarization-input`	Path to diarization JSON file with word-level speaker info.	`../assets/sample_diarization.json`
`--output`	Path for the output video file with speaker bounding boxes.	`speaker_detection_output.mp4`
`--skip-audio`	Skip sending separate audio; use audio embedded in the video stream.	`False`
`--preview-mode`	Send request to the preview NVCF NIM server.	`False`
`--api-key`	NGC API key for authentication. Required in preview mode.	`None`
`--function-id`	NVCF function ID for the service. Required in preview mode.	`None`
`--ssl-mode`	SSL mode: `DISABLED`, `TLS`, or `MTLS`.	`DISABLED`
`--ssl-key`	Path to SSL private key (required for MTLS).	`../ssl_key/ssl_key_client.pem`
`--ssl-cert`	Path to SSL certificate chain (required for MTLS).	`../ssl_key/ssl_cert_client.pem`
`--ssl-root-cert`	Path to SSL root certificate (required for TLS/MTLS).	`../ssl_key/ssl_ca_cert.pem`

Refer to the docs for more information.

NVIDIA

Active Speaker Detection

Step 1Generate API Key

Step 2Pull and Run the NIM

Command Line Arguments

Step 1
Generate API Key

Step 2
Pull and Run the NIM