Try NVIDIA NIM APIs

google gemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition visual qa chat google

google gemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition visual qa chat google

nvidia llama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nemo retriever embedding retrieval augmented generation text-to-embedding nvidia

nvidia llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligence multiple image understanding ocr nvidia

black-forest-labs FLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

image generation text-to-image run-on-rtx black-forest-labs

mistralai mistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generation multimodal image understanding mistralai

black-forest-labs FLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

image generation text-to-image run-on-rtx black-forest-labs

nvidia cosmos-predict1-7b

Generalist model to generate future world state as videos from text and image prompts to create synthetic training data for robots and autonomous vehicles.

synthetic data generation autonomous vehicles physical ai robotics text-to-world image-to-world nvidia

nvidia cosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

synthetic data generation physical ai policy evaluation robotics video-to-world nvidia

nvidia nv-embedcode-7b-v1

The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.

nemo retriever embedding retrieval augmented generation nvidia

microsoft phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

speech recognition visual qa language generation image-to-text chart and table understanding microsoft

nvidia cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlm vision language model image caption image to text nvidia

meta sam2

SAM 2 is a segmentation model that enables fast, precise selection of any object in any video or image.

meta computer vision segmentation video

nvidia llama-3.2-nv-embedqa-1b-v2

Multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage efficiency.

nemo retriever run-on-rtx embedding retrieval augmented generation text-to-embedding nvidia

nvidia llama-3.2-nv-rerankqa-1b-v2

Fine-tuned reranking model for multilingual, cross-lingual text question-answering retrieval, with long context support.

nemo retriever retrieval augmented generation reranking nvidia

university-at-buffalo cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

nemo retriever chart element detection image-to-text university-at-buffalo

baidu paddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

optical character recognition table extraction optical character detection nemo retriever data ingestion run-on-rtx extraction baidu

hive deepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

computer vision ai safety deep fake detection content moderation hive

meta llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

image-text retrieval visual qa image-to-text image captioning visual grounding meta

meta llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

image-text retrieval visual qa image captioning image-to-text visual grounding meta

nvidia vila

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlm vision language model image caption image to text nvidia

hive ai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

image classification computer vision ai safety content moderation hive

nvidia nv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

image-to-embedding computer vision deepstream nvidia nim object classification nvidia

briaai BRIA-2.3

An enterprise-grade text-to-image model trained on a compliant dataset produces high quality images.

image generation text-to-image briaai

microsoft florence-2

Vision foundation model capable of performing diverse computer vision and vision language tasks.

image classification image object detection cv multimodal vision assistant vlm visual question answering computer vision language generation image-to-text text-to-image microsoft

nvidia usdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

openusd synthetic data generation digital twin usd text-to-3d nvidia nim nvidia

nvidia nv-embedqa-e5-v5

English text embedding model for question-answering retrieval.

embedding retrieval augmented generation nemo retriever text-to-embedding nvidia

nvidia nv-embedqa-mistral-7b-v2

Multilingual text question-answering retrieval, transforming textual information into dense vector representations.

nemo retriever embedding retrieval augmented generation nvidia

nvidia maisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

image generation medical imaging nvidia nim nvidia

nvidia nvclip

NV-CLIP is a multimodal embeddings model for image and text.

computer vision multimodal embeddings text and image nvidia nim run-on-rtx nvidia

stabilityai stable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

image generation text-to-image stabilityai

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

optical character recognition image optical character detection cv vlm computer vision tao toolkit video nvidia

baai bge-m3

Embedding model for text retrieval tasks, excelling in dense, multi-vector, and sparse retrieval.

embeddings retrieval augmented generation text-to-embedding baai

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

image image generation cv image segmentation vlm computer vision tao toolkit video nvidia nim nvidia

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

object detection image cv vlm computer vision tao toolkit video nvidia nim nvidia

microsoft phi-3-vision-128k-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

image cv vision assistant vlm visual question answering computer vision language generation image-to-text video microsoft

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

image cv vision assistant vlm visual question answering computer vision language generation image-to-text video google

nvidia embed-qa-4

GPU-accelerated generation of text embeddings used for question-answering retrieval.

embeddings retrieval augmented generation text-to-embedding nvidia

nvidia rerank-qa-mistral-4b

GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

ranking retrieval augmented generation nvidia

microsoft kosmos-2

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

image cv multimodal vlm visual question answering computer vision image understanding image-to-text video microsoft

nvidia neva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

image cv vision assistant non-commercial use only vlm visual question answering computer vision image-to-text video nvidia

adept fuyu-8b

Multi-modal model for a wide range of tasks, including image understanding and language generation.

image cv multimodal vlm computer vision image understanding language generation image-to-text video adept

nvidia vista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

interactive annotation image segmentation non-commercial use only medical imaging nvidia

stabilityai stable-video-diffusion

Stable Video Diffusion (SVD) is a generative diffusion model that leverages a single image as a conditioning frame to synthesize video sequences.

image generation text-to-image stabilityai

stabilityai sdxl-turbo

A fast generative text-to-image model that can synthesize photorealistic images from a text prompt in a single network evaluation

image generation text-to-image stabilityai