NVIDIA
Explore Models Blueprints GPUs Docs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

Deploy Models Now with NVIDIA NIM

Optimized inference for the world’s leading models
Free serverless APIs for developmentAccelerated by DGX Cloud
Self-Host on your GPU infrastructure
Continuous vulnerability fixes
Discover
Models
Blueprints
GPUs
Docs
Forums
models
ReasoningVisionVisual DesignRetrievalSpeechBiologySimulationClimate & WeatherSafety & Moderation
industries
AutomotiveGamingHealthcareIndustrialRobotics

Vision

Explore NVIDIA Blueprints

Comprehensive reference workflows that accelerate application development and deployment, featuring NVIDIA acceleration libraries, APIs, and microservices for AI agents, digital twins, and more.

nvidiaBuild a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

blueprintenterpriselaunchablenvidia aichatgenerative aivideo-to-textvision

Vision Language Models (VLM)

Multimodal models that can reason against image and video inputs and perform descriptive language generation​

Run Anywhere

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

image-text retrievalvisual groundingvisual qaimage captioningimage-to-text
Run Anywhere

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

image captioningimage-text retrievalvisual groundingvisual qaimage-to-text
PREVIEW

nvidiavila

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlmvision language modelimage captionimage to text
PREVIEW

microsoftflorence-2

Vision foundation model capable of performing diverse computer vision and vision language tasks.

language generationmultimodalvision assistantvisual question answeringcomputer visioncvimageimage classificationimage-to-textobject detectiontext-to-imagevlm
PREVIEW

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

nvidia nimcomputer visiondeepstreamobject classificationimage-to-embedding
PREVIEW

nvidianv-grounding-dino

Grounding dino is an open vocabulary zero-shot object detection model.

nvidia nimobject detectioncomputer visiondeepstream
PREVIEW

nvidianeva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

non-commercial use onlyvision assistantvisual question answeringcomputer visioncvimageimage-to-textvideovlm
PREVIEW

microsoftphi-3-vision-128k-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

language generationvision assistantvisual question answeringcomputer visioncvimageimage-to-textvideovlm
PREVIEW

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

language generationvision assistantvisual question answeringcomputer visioncvimageimage-to-textvideovlm
PREVIEW

microsoftkosmos-2

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

image understandingmultimodalvisual question answeringcomputer visioncvimageimage-to-textvideovlm
PREVIEW

adeptfuyu-8b

Multi-modal model for a wide range of tasks, including image understanding and language generation.

image understandinglanguage generationmultimodalcomputer visioncvimageimage-to-textvideovlm

Specialized Foundation Models

Computer vision models that excel at particular visual perception tasks

PREVIEW

metasam2

SAM 2 is a segmentation model that enables fast, precise selection of any object in any video or image.

computer visionmetasegmentationvideo
PREVIEW

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

optical character detectionoptical character recognitiontao toolkitcomputer visioncvimagevideovlm
PREVIEW

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

image segmentationnvidia nimtao toolkitcomputer visioncvimageimage generationvideovlm
Run Anywhere

nvidianvclip

NV-CLIP is a multimodal embeddings model for image and text.

computer visionnvidia nimrun on rtxmultimodal embeddingstext and image
PREVIEW

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

nvidia nimobject detectiontao toolkitcomputer visioncvimagevideovlm