NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K

Deploy Models Now with NVIDIA NIM

Optimized inference for the world’s leading models
Free serverless APIs for development
Accelerated by DGX Cloud
Self-Host on your GPU infrastructure
Continuous vulnerability fixes
DiscoverModelsBlueprintsGPUsDocsForums

workstations

  • Run on RTX
  • Run on Spark

models

  • Reasoning
  • Vision
  • Visual Design
  • Retrieval
  • Speech
  • Biology
  • Simulation
  • Climate & Weather
  • Safety & Moderation

industries

  • Automotive
  • Financial Services
  • Gaming
  • Healthcare
  • Industrial
  • Robotics

Vision

Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

Explore NVIDIA Blueprints

Comprehensive reference workflows that accelerate application development and deployment, featuring NVIDIA acceleration libraries, APIs, and microservices for AI agents, digital twins, and more.

Enterprise

nvidiaBuild a Video Search and Summarization (VSS) Agent

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

BlueprintEnterpriseLaunchableNVIDIA AIchatgenerative AIvideo-to-textvision

Specialized Foundation Models

Computer vision models that excel at particular visual perception tasks

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character DetectionOptical Character RecognitionTAO Toolkitcomputer visioncvimagevideovlm

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

Image SegmentationNVIDIA NIMTAO Toolkitcomputer visioncvimageImage Generationvideovlm
Run Anywhere

nvidianvclip

NV-CLIP is a multimodal embeddings model for image and text.

Computer visionRun-on-rtxmultimodal embeddingstext and image

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

NVIDIA NIMObject DetectionTAO Toolkitcomputer visioncvimagevideovlm

Vision Language Models (VLM)

Multimodal models that can reason against image and video inputs and perform descriptive language generation​

nvidiacosmos-reason2-8b

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

Physical AIautonomous vehiclesindustrialreasoningroboticssmart citiesSynthetic Data Generationvideo understandingvision language model
Run Anywhere

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Image-Text RetrievalVisual GroundingVisual QAimage captioningImage-to-Text
Run Anywhere

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Image CaptioningImage-Text RetrievalVisual GroundingVisual QAImage-to-Text

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

NVIDIA NIMcomputer visiondeepstreamobject ClassificationImage-to-Embedding

nvidianv-grounding-dino

Grounding dino is an open vocabulary zero-shot object detection model.

NVIDIA NIMObject Detectioncomputer visiondeepstream

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Language GenerationVision AssistantVisual Question Answeringcomputer visioncvimageImage-to-Textvideovlm