Try NVIDIA NIM APIs

⌘KCtrl+K

Your Privacy Choices

Contact

Explore

⌘KCtrl+K

Search Results

Searching for: video understanding

Sort By

Publisher

Use Case

NIM Type

Blueprint Type

GPU Types

Launchable

Sorting by Most Recent

nvidia cosmos-reason2-8b

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

video understanding Synthetic Data Generation autonomous vehicles industrial Physical AI vision language model reasoning robotics smart cities

nvidia Cosmos Dataset Search

Accelerate post-training of end-to-end autonomous vehicle stacks with vector search and retrieval for large video datasets.

blueprint Autonomous Vehicles data Physical AI Search Enterprise Cosmos NVIDIA AI

nvidia nemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

language generation chat Image-to-Text vision assistant visual question answering

nvidia cosmos-reason1-7b

Reasoning vision language model (VLM) for physical AI and robotics.

video understanding Synthetic Data Generation autonomous vehicles industrial Physical AI vision language model reasoning robotics smart cities

nvidia cosmos-transfer1-7b

Generates physics-aware video world states for physical AI development using text prompts and multiple spatial control inputs derived from real-world data or simulation.

Synthetic Data Generation Autonomous Vehicles Physical AI robotics video-to-world

nvidia llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligence chat multiple image understanding OCR

mistralai mistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generation chat multimodal image understanding

nvidia cosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

Synthetic Data Generation Physical AI policy evaluation robotics video-to-world

microsoft phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech Recognition Visual QA chat Language Generation Image-to-Text Chart and Table Understanding

nvidia cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLM Vision language model image caption image to text

nvidia Build a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

vision video-to-text generative AI Launchable Blueprint chat Enterprise NVIDIA AI

meta llama-3.2-3b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chat Code Generation Text-to-Text Language Generation

meta llama-3.2-1b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chat Code Generation Text-to-Text Language Generation

rakuten rakutenai-7b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chat Text-to-Text Language Generation Large Language Models

rakuten rakutenai-7b-chat

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chat Text-to-Text Language Generation Large Language Models

nvidia eyecontact

Estimate gaze angles of a person in a video and redirect to make it frontal.

telepresence Nvidia Maxine Digital Human

meta llama-3.1-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

chat Code Generation Text-to-Text Language Generation

meta llama-3.1-8b-instruct

Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

chat Code Generation Text-to-Text Language Generation Run-on-RTX

google gemma-2-27b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chat Code Generation Text-to-Text Language Generation

google gemma-2-9b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chat Code Generation Text-to-Text Language Generation

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character Recognition image Optical Character Detection cv vlm computer vision TAO Toolkit video

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

image Image Generation cv Image Segmentation vlm computer vision TAO Toolkit video NVIDIA NIM

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

Object Detection image cv vlm computer vision TAO Toolkit video NVIDIA NIM

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

image cv Vision Assistant vlm Visual Question Answering computer vision Language Generation Image-to-Text video

meta llama3-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

chat Large Language models Code Generation Text-to-Text Language Generation

meta llama3-8b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chat Code Generation Text-to-Text Language Generation Large Language Models

google gemma-7b

Cutting-edge text generation model text understanding, transformation, and code generation.

chat Code Generation Text-to-Text Language Generation