Try NVIDIA NIM APIs

⌘KCtrl+K

Your Privacy Choices

Copyright © 2026 NVIDIA Corporation

16 results for

Filters (1)

API Endpoint

12

Download Available

5

Launchable

0

Enterprise

0

Use Case

Image-to-Text

13

Code Generation

0

Drug Discovery

0

Retrieval Augmented Generation

0

Object Detection

0

Publisher

Meta

4

Mistral AI

3

NVIDIA

2

Microsoft

2

Google

2

Blueprint Type

NVIDIA AI

0

NVIDIA Omniverse

0

NVIDIA BioNemo

0

NVIDIA Isaac GR00T

0

Labels (1)

Image-to-Text

Sort By

University at Buffalo

API Endpoint

cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

737

1y

API Endpoint

cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

6

1y

API Endpoint

gemma-3-27b-it

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

5.79M

9mo

Downloadable

kimi-k2.5

1T multimodal MoE for high‑capacity video and image understanding with efficient inference.

22.84M

1mo

Downloadable

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

844K

9mo

Downloadable

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

624K

9mo

API Endpoint

llama-4-maverick-17b-128e-instruct

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

3.25M

7mo

DownloadableAPI Endpoint

llama-4-scout-17b-16e-instruct

A multimodal, multilingual 16 MoE model with 17B parameters.

156K

7mo

API Endpoint

ministral-14b-instruct-2512

A general purpose VLM ideal for chat and instruction based use cases

4.67M

3mo

API Endpoint

mistral-large-3-675b-instruct-2512

A state-of-the-art general purpose MoE VLM ideal for chat, agentic and instruction based use cases.

6.69M

3mo

API Endpoint

mistral-medium-3-instruct

Powerful, multimodal language model designed for enterprise applications, including software development, data analysis, and reasoning.

5.28M

8mo

Downloadable

nemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

1.4M

4mo

API Endpoint

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

335K

1y

API Endpoint

phi-3.5-vision-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Vision Assistant

592K

1y

API Endpoint

phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech Recognition

532K

9mo

API Endpoint

qwen3.5-122b-a10b

122B MoE LLM (10B active) for coding, reasoning, multimodal chat. Agent-ready.

1.49M

1w

Items per page

of 1 pages