NVIDIA
Explore
Models
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

Models

Deploy and scale models on your GPU infrastructure of choice with NVIDIA NIM inference microservices
Optimized by NVIDIALaunch from Hugging FaceBeta
Sorting by Most Recent

nvidianemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

language generationchatImage-to-Textvision assistantvisual question answering

googlegemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionVisual QAchat

googlegemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionVisual QAchat

mistralaimistral-medium-3-instruct

Powerful, multimodal language model designed for enterprise applications, including software development, data analysis, and reasoning.

language generationchatImage-to-Textmultimodalvisual question answering

metallama-4-maverick-17b-128e-instruct

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

language generationchatImage-to-Textvision assistantvisual question answering

metallama-4-scout-17b-16e-instruct

A multimodal, multilingual 16 MoE model with 17B parameters.

language generationchatImage-to-Textvision assistantvisual question answering

googlegemma-3-27b-it

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Vision AssistantchatVisual Question AnsweringLanguage GenerationImage-to-Text

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech RecognitionVisual QAchatLanguage GenerationImage-to-TextChart and Table Understanding

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Image-Text RetrievalVisual QAchatImage-to-TextImage CaptioningVisual Grounding

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Image-Text RetrievalVisual QAimage captioningchatImage-to-TextVisual Grounding

microsoftphi-3.5-vision-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Vision AssistantVisual Question AnsweringLanguage GenerationImage-to-Text

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

Image-to-Embeddingcomputer visiondeepstreamNVIDIA NIMobject Classification

nvidianv-grounding-dino

Grounding dino is an open vocabulary zero-shot object detection model.

Object Detectioncomputer visiondeepstreamNVIDIA NIM

nvidiausdvalidate

Verify compatibility of OpenUSD assets with instant RTX render and rule-based validation.

ValidationOpenUSDSynthetic Data GenerationDigital TwinUSDVisualization 3D

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

imageImage GenerationcvImage Segmentationvlmcomputer visionTAO ToolkitvideoNVIDIA NIM

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

imagecvVision AssistantvlmVisual Question Answeringcomputer visionLanguage GenerationImage-to-Textvideo