Try NVIDIA NIM APIs

⌘KCtrl+K

Your Privacy Choices

Contact

Explore

⌘KCtrl+K

13 results for

Filters

Free Endpoint

Partner Endpoint

Download Available

Use Case

Image-to-Text

Synthetic Data Generation

Inference Providers

Deep Infra

Bitdeer AI

GMI Cloud

Together AI

Publisher

NVIDIA

Meta

Google

Qwen

Stepfun ai

NIM Container GPUs

B200

H100 80GB HBM3

H200

B300 SXM6 AC

DGX Spark

Sort By

Meta

Downloadable

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

Items per page

of 1 pages

1.56M

Meta

Downloadable

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

2.22M

Google

Free Endpoint

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model

image

15.84K

NVIDIA

Downloadable

ising-calibration-1-35b-a3b

Open VLM for quantum computer calibration chart understanding across a range of qubit modalities.

Model

Quantum

352K

1mo

Meta

Free Endpoint

llama-4-maverick-17b-128e-instruct

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

Model

language generation

26.82M

10mo

NVIDIA

Downloadable

nemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

Model

language generation

2.89M

7mo

NVIDIA

Downloadable

nvclip

NV-CLIP is a multimodal embeddings model for image and text.

Model

Computer vision

11mo

Stepfun-ai

DownloadableFree Endpoint

step-3.7-flash

A sparse MoE multimodal reasoning model good for enterprise, agentic and coding tasks.

Model

B200

NVIDIA

Downloadable

cosmos-reason2-8b

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

Model

B200

505K

5mo

NVIDIA

Downloadable

llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

Model

doc intelligence

8.58M

11mo

NVIDIA

Downloadable

nemoretriever-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Model

optical character recognition

125K

11mo

NVIDIA

Downloadable

nemotron-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Model

text and table extraction

293K

7mo

Qwen

Downloadable

qwen3.5-397b-a17b

Next-gen Qwen 3.5 VLM (400B MoE) brings advanced vision, chat, RAG, and agentic capabilities.

Model

MoE

12.01M

3mo