
nvidia
Content Localization
Localize and translate media and sync multiple speaker’s lips to translated audio.
The Content-Localization Blueprint is a reference architecture designed for media producers and creators that deliver news, sports, movies, and television programming. It is specifically engineered to help these companies localize content for global audiences, thereby unlocking new revenue opportunities without requiring the duplication of their existing production infrastructure. This blueprint offers a modular and extensible, scalable, NIM-centric design that supports post-production workflows for localization for both audio and video workflows. It achieves this by orchestrating a suite of NVIDIA and partner AI microservices to enable key features like speech translation, active speaker detection, and AI-driven lip-sync for media.
Architecture Diagram
Features
Content-Localization Blueprint (gRPC) is a modular, reference architecture that orchestrates NVIDIA and partner AI microservices to enable localization of media.
The blueprint enables:
- Localization across audio and video, with optional graphics integrations
- Automated speech‑to‑speech translation, including single‑speaker and multi‑speaker workflows
- AI‑driven lip‑sync animation of presenters using translated audio
- Flexible deployment modes including streaming, and transactional, file-based processing
The blueprint is built around composable NVIDIA Inference Microservices (NIMs), custom controller logic, and client services, allowing customers and partners to integrate localization capabilities into existing broadcast or streaming pipelines without re‑architecting production workflows. Theblueprint integrates 3rd party speech-to-speech dubbing providers such as CAMB.AI and ElevenLabs alongside NVIDIA Riva.
To get access to the LipSync feature of the Content localization Blueprint, please request to join our NVIDIA AI for Media Private Access Program
What the Blueprint Enables
- Speech‑to‑speech translation pipelines using NVIDIA Riva and partner TTS models like CAMB.AI and ElevenLabs.
- Active Speaker Detection and diarization to support accurate localization in video content
- LipSync animation driven by translated audio and speaker bounding boxes
- Background audio preservation, enabling localized speech while retaining music and ambience
- End‑to‑end multi-threaded workflows optimized for content creation post-production workflows using gRPC.
The architecture supports both single‑speaker and multi‑speaker scenarios, and is designed to evolve as additional NIM capabilities become available.
Benefits
- Expand global audience reach without duplicating production pipelines per language
- Reduce localization costs by replacing manual, multi‑production workflows with AI‑driven services
- Integrate incrementally with existing broadcast, OTT, or cloud media architectures
- Future‑proof localization workflows as NIM capabilities evolve across versions
License/Terms of Use
GOVERNING TERMS (restricted access): The blueprint software is governed by the Apache License 2.0, and enables use of separate open source and proprietary software, models and services governed by their respective licenses, including those below.
- Active Speaker Detection NIM
- LipSync NIM
- RIVA ASR NIM
- RIVA Magpie-TTS-Zeroshot
- Eleven Labs API service
- Camb.ai service
Sample Assets: Use of the assets is governed by the NVIDIA Sample Data License.
Deployment Geography
Global
Use Cases
- Speech‑to‑speech translation for film, television, and other video content
- AI‑driven lip‑sync of localized presenter audio
- Scalable localization for global sports, film and video content, podcasts, and news distribution
Who Is It For
The Content-Localization Blueprint is designed for engineering‑led media organizations evaluating or deploying AI‑driven localization within audio and video pipelines.
Software Integration
Runtime Engine(s): NVIDIA Dynamo-Triton (formerly NVIDIA Triton Inference Server)
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Turing
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Lovelace
Supported Operating System(s):
- Linux
Inference
Acceleration Engine: TensorRT, Triton
Test Hardware: Per-NIM.
- T4
- A2, A10, A16, A40
- L4, L40, L40s
- NVIDIA RTX PRO 6000 Blackwell Server Edition
For entire blueprint:
- A40, L40, L40s, NVIDIA RTX PRO 6000 Blackwell Server Edition
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
You may not directly or indirectly use this Content Localization Blueprint to alter the name, likeness, image, or voice of any person in violation of applicable law or regulation or without the person’s express consent.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.
