Explore
Models
Blueprints
GPUs
Docs
⌘K
Ctrl+K
?
Login
Models
Deploy and scale models on your GPU infrastructure of choice with NVIDIA NIM inference microservices
Optimized by NVIDIA
Launch from Hugging Face
Beta
Filters
13 models
Sort By
dateCreated:DESC
Most Recent
NVIDIA
cosmos-reason2-8b
Vision language model that excels in understanding the physical world using structured reasoning on videos or images.
video understanding
+8
205K
2mo
NVIDIA
nemotron-parse
Cutting-edge vision-language model exceling in retrieving text and metadata from images.
text and table extraction
+2
448K
4mo
NVIDIA
cosmos-reason1-7b
Reasoning vision language model (VLM) for physical AI and robotics.
video understanding
+8
15.93K
6mo
NVIDIA
llama-3.1-nemotron-nano-vl-8b-v1
Multi-modal vision-language model that understands text/img and creates informative responses
doc intelligence
+3
7.51M
8mo
Meta
llama-4-maverick-17b-128e-instruct
A general purpose multimodal, multilingual 128 MoE model with 17B parameters.
language generation
+4
3.01M
7mo
Meta
llama-4-scout-17b-16e-instruct
A multimodal, multilingual 16 MoE model with 17B parameters.
language generation
+4
210K
7mo
Google
gemma-3-27b-it
Cutting-edge open multimodal model exceling in high-quality reasoning from images.
Vision Assistant
+4
5.48M
9mo
NVIDIA
nemoretriever-parse
Cutting-edge vision-language model exceling in retrieving text and metadata from images.
optical character recognition
+4
295K
9mo
NVIDIA
cosmos-nemotron-34b
Multi-modal vision-language model that understands text/img/video and creates informative responses
VLM
+3
6
1y
Meta
llama-3.2-11b-vision-instruct
Cutting-edge vision-language model exceling in high-quality reasoning from images.
Image-Text Retrieval
+5
676K
9mo
Meta
llama-3.2-90b-vision-instruct
Cutting-edge vision-Language model exceling in high-quality reasoning from images.
Image-Text Retrieval
+5
579K
9mo
Microsoft
phi-3.5-vision-instruct
Cutting-edge open multimodal model exceling in high-quality reasoning from images.
Vision Assistant
+3
521K
1y
Google
paligemma
Vision language model adept at comprehending text and visual inputs to produce informative responses
image
+8
330K
1y
Items per page
24
1
1
of 1 pages