Insightful, accurate, and interactive video analytics AI agents enable a range of industries to make better decisions faster. These AI agents are given tasks through natural language and can perform complex operations like video summarization and visual question-answering, unlocking entirely new application possibilities. The NVIDIA AI Blueprint makes it easy to get started building and customizing video analytics AI agents for video search and summarization — all powered by generative AI, vision language models (VLMs) like Cosmos Nemotron VLMs, large language models (LLMs) like Llama Nemotron LLMs, NVIDIA NeMo Retriever, and NVIDIA NIM.
The following NIM microservices are used in this blueprint:
cosmos-nemotron-34b
meta / llama-3.1-70b-instruct
llama-3_2-nv-embedqa-1b-v2
llama-3_2-nv-rerankqa-1b-v2
The core video search and summarization blueprint pipeline supports the following hardware:
NVIDIA AI Blueprints are customizable agentic workflow examples that equip enterprise developers with NIM microservices, reference code, documentation, and a Helm chart for deployment. This blueprint offers users a reference architecture for deploying a visual agent adept at understanding long-form videos. The agent integrates a scalable video ingestion pipeline, a context manager, a vision-language model (VLM), a large language model (LLM), and a Context-Aware Retrieval-Augmented Generation (CA-RAG) module. CA-RAG module leverages dense captions stored in vector and graph databases as its primary sources for video understanding.
The user selects an example video and prompt to guide the agent in generating a detailed summary. The agent splits the input video into smaller segments that are processed by a VLM (The preview uses OpenAI's GPT4o). These segments are processed in parallel by the VLM pipeline to produce detailed captions describing the events of each chunk in a scalable and efficient manner. The agent recursively summarizes the dense captions using an LLM, generating a final summary for the entire video once all chunk captions are processed.
Additionally, these captions are stored in vector and graph databases to power the Q&A feature of this blueprint, allowing the user to ask any open-ended questions about the video.
Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.
GOVERNING TERMS: This preview is governed by the NVIDIA API Trial Terms of Service.
For the model that includes a Llama3.1 model: Llama 3.1 Community License Agreement, Built with Llama.
For the NVIDIA Retrieval QA Mistral 4B Reranking: Apache license.
For the NVIDIA Retrieval QA E5 Embedding v5: NV-EmbedQA-E5-v5: MIT license; NV-EmbedQA-Mistral7B-v2: Apache 2.0 license, and Snowflake arctic-embed-l: Apache 2.0 license.
Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A