Try NVIDIA NIM APIs

⌘KCtrl+K

Your Privacy Choices

Contact

Explore

⌘KCtrl+K

16 results for

Filters (1)

API Endpoint

Download Available

Launchable

Enterprise

Use Case

Image-to-Text

Code Generation

Drug Discovery

Retrieval Augmented Generation

Object Detection

Publisher

Meta

Mistral AI

NVIDIA

Microsoft

Google

Blueprint Type

NVIDIA AI

NVIDIA Omniverse

NVIDIA BioNemo

NVIDIA Isaac GR00T

Labels (1)

Image-to-Text

Sort By

University at Buffalo

cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

Model

nemo retriever

738

NVIDIA

cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

Model

VLM

Google

gemma-3-27b-it

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Model

Vision Assistant

5.33M

9mo

Moonshotai

kimi-k2.5

1T multimodal MoE for high‑capacity video and image understanding with efficient inference.

Model

Multimodal

19.83M

1mo

Meta

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

617K

9mo

Meta

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

568K

9mo

Meta

llama-4-maverick-17b-128e-instruct

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

Model

language generation

2.75M

7mo

Meta

llama-4-scout-17b-16e-instruct

A multimodal, multilingual 16 MoE model with 17B parameters.

Model

language generation

263K

7mo

Mistral AI

ministral-14b-instruct-2512

A general purpose VLM ideal for chat and instruction based use cases

Model

language generation

3.6M

3mo

Mistral AI

mistral-large-3-675b-instruct-2512

A state-of-the-art general purpose MoE VLM ideal for chat, agentic and instruction based use cases.

Model

language generation

4.89M

3mo

Mistral AI

mistral-medium-3-instruct

Powerful, multimodal language model designed for enterprise applications, including software development, data analysis, and reasoning.

Model

language generation

3.69M

7mo

NVIDIA

nemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

Model

language generation

1.55M

4mo

Google

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model

image

327K

Microsoft

phi-3.5-vision-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Model

Vision Assistant

451K

Microsoft

phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Model

Speech Recognition

385K

9mo

Qwen

qwen3.5-122b-a10b

122B MoE LLM (10B active) for coding, reasoning, multimodal chat. Agent-ready.

Model

tool calling

Items per page

of 1 pages