NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

Build a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

cosmos-reason2-8b•nvidia-nemotron-nano-9b-v2
chatgenerative AIvideo-to-textvision
View GitHubDeploy on Cloud

Blueprint Overview

Use Case Description

The NVIDIA AI Blueprint for Video Search and Summarization (VSS) makes it easy to start building and customizing video analytics AI agents. These insightful, accurate, and interactive agents are powered by generative AI, vision language models (VLMs), large language models (LLMs), and NVIDIA NIM™ Microservices—helping a variety of industries make better decisions, faster. They can be given tasks through natural language and perform complex operations like video summarization and visual question-answering, unlocking entirely new application possibilities.

Test the VSS blueprint on the cloud with Launchable, a set of pre-configured sandbox instances that let you quickly try the blueprint without having to bring your own compute infrastructure.

Key Features

  • Real-time and batch processing modes
  • Video search
  • Video summarization
  • Interactive Question and Answering (Q&A)
  • Alerts
  • Event reviewer and verification
  • Object tracking
  • Multimodal model fusion

Key Benefits

  • Build video analytics AI agents that can analyze, interpret, and process vast amounts of video data at scale.
  • Produce summaries of long videos up to 100X faster than going through the videos manually.
  • Accelerate development time by bringing together various generative AI models and services to quickly build AI agents.
  • Augment traditional computer vision pipelines with VLMs to provide deep video understanding.
  • Provide a range of optimized deployments, from the enterprise edge to cloud.

Architecture Diagram

Included NIM

The following NIM microservices are used in this blueprint:

  • cosmos-reason2-8b
  • nemotron-nano-9b-v2

Minimum System requirements

Core engine

The core video search and summarization blueprint pipeline supports the following hardware:

  • RTX Pro 6000 WS/SE
  • DGX Spark
  • Jetson Thor
  • B200
  • H200
  • H100
  • A100
  • L40/L40S
  • A6000

Hosted NIMs

  • NVIDIA-Nemotron-Nano-9B-v2 requires the following minimum GPU configuration based on this support matrix.
  • Cosmos Reason 2 VLM requires 1xL40s as a minimum GPU.

Minimum Local Deployment Configuration

The following configurations have been validated as minimal, local deployments.

  • 1 x RTX Pro 6000 WS/SE/DGX Spark/Jetson Thor/B200/H100/H200/A100 (80 GB)
  • 4 x L40/L40S/A6000

What’s included in the Blueprint

NVIDIA AI Blueprints are customizable agentic workflow examples that include NIM microservices, reference code, documentation, and docker compose for deployment. This blueprint gives you a reference architecture to deploy a visual agent that can quickly generate insights from stored and streamed video through a scalable video ingestion pipeline, VLMs, and hybrid-RAG modules.

Agent Workflows

VSS contains multiple agent workflows, which are end-to-end use cases which define how the VSS Agent processes requests and coordinates between microservices. The blueprint example demonstrates the video summarization agent workflow.

Additional workflows include:

  • Search: The Search agent workflow enables natural language queries across video archives to locate specific events, objects, or actions. This workflow utilizes multi-embedding ingestion, retrieval, and fusion of results. Once highly-relevant results are identified, the agent will then critique the clips based on the user query before giving the final response.

  • Alerts: There are two alerts-based Agent workflows which leverage VLMs, alert verification and real-time alerting. Alert verification uses a VLM to analyze video snippets corresponding to alerts generated upstream through computer vision for verification. The real-time alerts workflow uses a VLM continuously processes segments from a video source (eg: camera) at periodic intervals based on a user defined chunk duration.

Core Technology

VSS is broken down into 3 major areas of video processing and analysis:

  • Real-time video intelligence: The Real-Time Video Intelligence layer extracts rich visual features, semantic embeddings, and contextual understanding from video data in real-time, publishing results to a message broker for downstream analytics and agentic workflows. It provides three core microservices for processing video streams.
  • Downstream analytics: The Downstream Analytics layer processes and enriches the metadata streams generated by real-time video intelligence microservices, transforming raw detections into actionable insights and verified alerts.
  • Agent and offline processing: The top-level agent leverages the Model Context Protocol (MCP) to access video analytics data, incident records, and vision processing capabilities through a unified tool interface. It integrates multiple vision-based tools including video understanding with Vision Language Models (VLMs), semantic video search using embeddings, long video summarization for extended footage analysis, and video snapshot/clip retrieval.

Example Walkthrough

The user selects an example video and prompt to guide the agent in generating a detailed summary. The agent splits the input video into smaller segments that are processed by a VLM (The preview uses OpenAI's GPT4o). These segments are processed in parallel by the VLM pipeline to produce detailed captions describing the events of each chunk in a scalable and efficient manner. The agent recursively summarizes the dense captions using an LLM, generating a final summary for the entire video once all chunk captions are processed.

We also provide examples to also demonstrate the computer vision pipeline with object tracking as well as audio support.

Additionally, these captions are stored in vector and graph databases to power the Q&A feature of this blueprint, allowing the user to ask any open-ended questions about the video.

License

Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.

Terms of Use

GOVERNING TERMS: This preview is governed by the NVIDIA API Trial Terms of Service.

Additional Information:

For the model that includes a Llama3.1 model: Llama 3.1 Community License Agreement, Built with Llama.

For the NVIDIA Retrieval QA Llama 3.2 1B Embedding v2 and NVIDIA Retrieval QA Llama 3.2 1B Reranking v2: Llama 3.2 Community License Agreement, Built with Llama.

For https://github.com/google-research/big_vision/blob/main/LICENSE and https://github.com/01-ai/Yi/blob/main/LICENSE: Apache 2.0 license.