Try NVIDIA NIM APIs

⌘KCtrl+K

Your Privacy Choices

Contact

Explore

⌘KCtrl+K

32 results for

Filters

Free Endpoint

Partner Endpoint

Download Available

Use Case

Image Generation

Image-to-Text

Text-to-Image

Synthetic Data Generation

Optical Character Recognition

Inference Providers

Deep Infra

Bitdeer AI

Together AI

GMI Cloud

Vultr

Publisher

NVIDIA

Black forest labs

Qwen

Google

qwen-image

Qwen-Image is a text-to-image foundation model with advanced multilingual text rendering.

Model

Text-to-Image

1mo

Items per page

of 2 pages

Qwen

Downloadable

qwen-image-edit

Qwen-Image-Edit is an image editing model with multilingual text editing and strong subject consistency.

Model

Text-to-Image

1mo

llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

1.71M

Qwen

DownloadableFree Endpoint

qwen3.5-397b-a17b

Next-gen Qwen 3.5 VLM (400B MoE) brings advanced vision, chat, RAG, and agentic capabilities.

Model

MoE

12.46M

3mo

llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Model

Image-Text Retrieval

2.8M

NVIDIA

Free Endpoint

cosmos3-nano

Generates physics-aware videos from text prompts or an image prompt for physical AI development.

Model

autonomous vehicles

1.58K

Black-forest-labs

Downloadable

flux.2-klein-4b

FLUX.2-klein-4B is a distilled image generation and editing model, producing outputs at lighting speed

Model

image editing

270K

2mo

Microsoft

Downloadable

TRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

Model

text-to-3d

3.99K

9mo

NVIDIA

DownloadableFree Endpoint

llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

Model

doc intelligence

9.93M

11mo

Mistral AI

DownloadableFree Endpoint

mistral-small-4-119b-2603

Hybrid MoE model unifying instruct, reasoning, and coding with multimodal input and 256k context

Model

code generation

16.19M

2mo

Google

Free Endpoint

paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model

image

10.29K

NVIDIA

Downloadable

vista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

Model

Interactive Annotation

803

Qwen

DownloadableFree Endpoint

qwen3.5-122b-a10b

122B MoE LLM (10B active) for coding, reasoning, multimodal chat. Agent-ready.

Model

B200

10.34M

3mo

Baidu

Downloadable

paddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

Model

Optical Character Recognition

415K

11mo

Black-forest-labs

Downloadable

FLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

Model

Text-to-Image

257K

Black-forest-labs

Downloadable

FLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

Model

Text-to-Image

3.63K

9mo

Black-forest-labs

Downloadable

FLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

Model

Text-to-Image

265K

Moonshotai

DownloadableFree Endpoint

kimi-k2.6

1T multimodal MoE for long-horizon coding, agentic tool use, and image/video understanding.

Model

B200

6.78M

1mo

NVIDIA

Downloadable

nemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Model

Table Extraction

8.97K

10mo

NVIDIA

Downloadable

nemotron-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Model

Table Extraction

351K

2mo

Microsoft

Free Endpoint

phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Model

Speech Recognition

269K

Stability AI

Downloadable

stable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

Model

Text-to-Image

9mo

Google

Free Endpoint

gemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

Model

language generation

43.86M

10mo

Google

Free Endpoint

gemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

Model

language generation

3.74M

10mo