Explore Vision Models | Try NVIDIA NIM APIs

Deploy Models Now with NVIDIA NIM

Optimized inference for the world’s leading models

Free serverless APIs for development

Accelerated by DGX Cloud

Self-Host on your GPU infrastructure

Continuous vulnerability fixes

Discover Models Blueprints GPUs Docs Forums

workstations

Run on RTX
Run on Spark

models

Reasoning
Vision
Visual Design
Retrieval
Speech
Biology
Simulation
Climate & Weather
Safety & Moderation

industries

Automotive
Financial Services
Gaming
Healthcare
Industrial
Robotics

Vision

Your Privacy Choices

Copyright © 2025 NVIDIA Corporation

Explore NVIDIA Blueprints

Comprehensive reference workflows that accelerate application development and deployment, featuring NVIDIA acceleration libraries, APIs, and microservices for AI agents, digital twins, and more.

Enterprise

nvidia Build a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

Specialized Foundation Models

Computer vision models that excel at particular visual perception tasks

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

Run Anywhere

nvidia nvclip

NV-CLIP is a multimodal embeddings model for image and text.

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

Vision Language Models (VLM)

Multimodal models that can reason against image and video inputs and perform descriptive language generation

Run Anywhere

nvidia cosmos-reason1-7b

Reasoning vision language model (VLM) for physical AI and robotics.

Run Anywhere

meta llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Run Anywhere

meta llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Deprecation in 23 days

nvidia vila

Multi-modal vision-language model that understands text/img/video and creates informative responses

nvidia nv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

nvidia nv-grounding-dino

Grounding dino is an open vocabulary zero-shot object detection model.

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses