Try NVIDIA NIM APIs

Explore

Models

Skills

Blueprints

14 results for

Filters

Free Endpoint

Partner Endpoint

Download Available

Use Case

Image-to-Text

Synthetic Data Generation

Inference Providers

Deepinfra

OpenRouter

GMI Cloud

Together AI

Publisher

NVIDIA

Meta

Google

Minimaxai

Audience

AI Engineer

Application Developer

Data Scientist

Ml Engineer

Domain

AI And Machine Learning

Library

TAO Toolkit

Sort By

DGX Spark

1 HR

Vision-Language Model Fine-tuning

Fine-tune Vision-Language Models for image and video understanding tasks using Qwen2.5-VL and InternVL3

Playbook

DGX

9mo

Items per page

of 1 pages

Meta

DownloadableFree Endpoint

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

Meta

DownloadableFree Endpoint

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

NVIDIA

DownloadableFree Endpoint

ising-calibration-1-35b-a3b

Open VLM for quantum computer calibration chart understanding across a range of qubit modalities.

Model

Quantum

332K

2mo

NVIDIA

Downloadable

cosmos-reason2-8b

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

Model

video understanding

191K

6mo

NVIDIA

DownloadableFree Endpoint

cosmos3-nano-reasoner

Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

Model

video understanding

1mo

Google

Free Endpoint

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model

image

10K

Meta

Free Endpoint

llama-4-maverick-17b-128e-instruct

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

Model

language generation

20M

11mo

DGX Spark

20 MIN

Live VLM WebUI

Real-time Vision Language Model interaction with webcam streaming

Playbook

Vision AI

6mo

NVIDIA

Downloadable

nemoretriever-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Model

optical character recognition

86K

NVIDIA

Downloadable

nemotron-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Model

text and table extraction

218K

8mo

NVIDIA

DownloadableFree Endpoint

llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

Model

doc intelligence

10M

Minimaxai

Free Endpoint

minimax-m3

MiniMax M3 Preview is a multimodal MoE vision-language model with strong reasoning, coding, and tool-calling capabilities.

Model

coding

10M

29d

CLIP vision-language model for image-text retrieval, zero-shot classification, embedding extraction, ONNX export, and TensorRT deployment. Use when fine-tuning or training CLIP, running zero-shot classification, computing image embeddings, or deploying CL

Skill

AI Engineer

28d