NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Search Results

Searching for: video understanding
Sorting by Most Recent

nvidiacosmos-reason2-8b

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

video understandingSynthetic Data Generationautonomous vehiclesindustrialPhysical AIvision language modelreasoningroboticssmart cities

nvidiaCosmos Dataset Search

Accelerate post-training of end-to-end autonomous vehicle stacks with vector search and retrieval for large video datasets.

blueprintAutonomous VehiclesdataPhysical AISearchEnterpriseCosmosNVIDIA AI

nvidianemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

language generationchatImage-to-Textvision assistantvisual question answering

nvidiacosmos-reason1-7b

Reasoning vision language model (VLM) for physical AI and robotics.

video understandingSynthetic Data Generationautonomous vehiclesindustrialPhysical AIvision language modelreasoningroboticssmart cities

nvidiacosmos-transfer1-7b

Generates physics-aware video world states for physical AI development using text prompts and multiple spatial control inputs derived from real-world data or simulation.

Synthetic Data GenerationAutonomous VehiclesPhysical AIroboticsvideo-to-world

nvidiallama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligencechatmultiple image understandingOCR

mistralaimistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generationchatmultimodalimage understanding

nvidiacosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

Synthetic Data GenerationPhysical AIpolicy evaluationroboticsvideo-to-world

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech RecognitionVisual QAchatLanguage GenerationImage-to-TextChart and Table Understanding

nvidiacosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLMVision language modelimage captionimage to text

nvidiaBuild a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

visionvideo-to-textgenerative AILaunchableBlueprintchatEnterpriseNVIDIA AI

metallama-3.2-3b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chatCode GenerationText-to-TextLanguage Generation

metallama-3.2-1b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chatCode GenerationText-to-TextLanguage Generation

rakutenrakutenai-7b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatText-to-TextLanguage GenerationLarge Language Models

rakutenrakutenai-7b-chat

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatText-to-TextLanguage GenerationLarge Language Models

nvidiaeyecontact

Estimate gaze angles of a person in a video and redirect to make it frontal.

telepresenceNvidia MaxineDigital Human

metallama-3.1-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

chatCode GenerationText-to-TextLanguage Generation

metallama-3.1-8b-instruct

Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

chatCode GenerationText-to-TextLanguage GenerationRun-on-RTX

googlegemma-2-27b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chatCode GenerationText-to-TextLanguage Generation

googlegemma-2-9b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chatCode GenerationText-to-TextLanguage Generation

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character RecognitionimageOptical Character Detectioncvvlmcomputer visionTAO Toolkitvideo

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

imageImage GenerationcvImage Segmentationvlmcomputer visionTAO ToolkitvideoNVIDIA NIM

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

Object Detectionimagecvvlmcomputer visionTAO ToolkitvideoNVIDIA NIM

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

imagecvVision AssistantvlmVisual Question Answeringcomputer visionLanguage GenerationImage-to-Textvideo

metallama3-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

chatLarge Language modelsCode GenerationText-to-TextLanguage Generation

metallama3-8b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatCode GenerationText-to-TextLanguage GenerationLarge Language Models

googlegemma-7b

Cutting-edge text generation model text understanding, transformation, and code generation.

chatCode GenerationText-to-TextLanguage Generation