Try NVIDIA NIM APIs

Skip to main content

Your Privacy Choices

Copyright © 2026 NVIDIA Corporation

4 results for

Filters

Free Endpoint

2

Partner Endpoint

2

Download Available

2

Use Case

Image-to-Text

2

Inference Providers

Deepinfra

2

Together AI

1

Publisher

Meta

2

NVIDIA

2

Audience

AI Engineer

2

Data Scientist

2

Developer

2

Ml Engineer

2

Domain

AI And Machine Learning

2

Library

TAO Toolkit

2

Sort By

Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded annotations, auto-label

1K

1mo

DownloadableFree Endpoint

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Image-Text Retrieval

Items per page

of 1 pages

3M

1y

DownloadableFree Endpoint

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Image-Text Retrieval

4M

1y

Four-step image referring-expression pipeline: turns images plus KITTI bounding-box labels into region descriptions, scene captions, grounded referring expressions, and (optionally) verified expressions via VLM distillation. Use when the user wants to gen

1K

1mo