NVIDIA
Explore Models Blueprints GPUs Docs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

Search Results

Searching for: video understanding
Sorting by Most Recent

nvidiacosmos-reason1-7b

Reasoning vision language model (VLM) for physical AI and robotics.

video understandingsynthetic data generationautonomous vehiclesindustrialphysical aivision language modelreasoningroboticssmart citiesnvidia

nvidiacosmos-transfer1-7b

Generates physics-aware video world states for physical AI development using text prompts and multiple spatial control inputs derived from real-world data or simulation.

synthetic data generationautonomous vehiclesphysical airoboticsvideo-to-worldnvidia

nvidiallama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligencemultiple image understandingocrnvidia

mistralaimistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generationmultimodalimage understandingmistralai

nvidiacosmos-predict1-7b

Generalist model to generate future world state as videos from text and image prompts to create synthetic training data for robots and autonomous vehicles.

synthetic data generationautonomous vehiclesphysical airoboticstext-to-worldimage-to-worldnvidia

nvidiacosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

synthetic data generationphysical aipolicy evaluationroboticsvideo-to-worldnvidia

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

speech recognitionvisual qalanguage generationimage-to-textchart and table understandingmicrosoft

nvidiacosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlmvision language modelimage captionimage to textnvidia

nvidiaBuild a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

visionvideo-to-textgenerative ailaunchableblueprintchatenterprisenvidia ainvidia

metallama-3.2-3b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chatcode generationchattext-to-textlanguage generationmeta

metallama-3.2-1b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

chatcode generationtext-to-textlanguage generationmeta

nvidiavila

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlmvision language modelimage captionimage to textnvidia

rakutenrakutenai-7b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatchattext-to-textlanguage generationlarge language modelsrakuten

rakutenrakutenai-7b-chat

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatchattext-to-textlanguage generationlarge language modelsrakuten

nvidiaeyecontact

Estimate gaze angles of a person in a video and redirect to make it frontal.

telepresencenvidia maxinedigital humannvidia

metallama-3.1-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

code generationchattext-to-textlanguage generationmeta

metallama-3.1-8b-instruct

Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

code generationchattext-to-textlanguage generationrun-on-rtxmeta

googlegemma-2-27b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chatcode generationchattext-to-textlanguage generationgoogle

googlegemma-2-9b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

chatcode generationtext-to-textlanguage generationgoogle

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

optical character recognitionimageoptical character detectioncvvlmcomputer visiontao toolkitvideonvidia

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

imageimage generationcvimage segmentationvlmcomputer visiontao toolkitvideonvidia nimnvidia

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

object detectionimagecvvlmcomputer visiontao toolkitvideonvidia nimnvidia

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

imagecvvision assistantvlmvisual question answeringcomputer visionlanguage generationimage-to-textvideogoogle

databricksdbrx-instruct

A general-purpose LLM with state-of-the-art performance in language understanding, coding, and RAG.

chatchattext-to-textlanguage generationlarge language modelsdatabricks

metallama3-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

chatlarge language modelscode generationchattext-to-textlanguage generationmeta

metallama3-8b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

chatcode generationchattext-to-textlanguage generationlarge language modelsmeta

microsoftkosmos-2

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

imagecvmultimodalvlmvisual question answeringcomputer visionimage understandingimage-to-textvideomicrosoft

googledeplot

Translate images of plots into tables with one-shot visual language understanding.

nemo retrievermultimodaldata ingestionimage-to-textextractiongoogle

nvidianeva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

imagecvvision assistantnon-commercial use onlyvlmvisual question answeringcomputer visionimage-to-textvideonvidia

adeptfuyu-8b

Multi-modal model for a wide range of tasks, including image understanding and language generation.

imagecvmultimodalvlmcomputer visionimage understandinglanguage generationimage-to-textvideoadept

googlegemma-7b

Cutting-edge text generation model text understanding, transformation, and code generation.

chatcode generationchattext-to-textlanguage generationgoogle