Try NVIDIA NIM APIs

Explore

Models

Blueprints

GPUs

Docs

Your Privacy Choices

Contact

Search Results

Searching for: Image Understanding

Sorting by Most Recent

microsoft TRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

stabilityai stable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

black-forest-labs FLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

nvidia cosmos-reason1-7b

Reasoning vision language model (VLM) for physical AI and robotics.

nvidia nemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

nvidia nemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

google gemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

google gemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

nvidia llama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nvidia llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

black-forest-labs FLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

mistralai mistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

nvidia 3D Guided Generative AI

Create high quality images using Flux.1 in ComfyUI, guided by 3D.

black-forest-labs FLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

nvidia Synthetic Manipulation Motion Generation for Robotics

Generate exponentially large amounts of synthetic motion trajectories for robot manipulation from just a few human demonstrations.

nvidia cosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

google gemma-3-27b-it

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

nvidia nemoretriever-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

microsoft phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

nvidia cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

university-at-buffalo cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

baidu paddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

hive deepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

meta llama-3.2-3b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

meta llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

meta llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

meta llama-3.2-1b-instruct

Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.

nvidia vila

Multi-modal vision-language model that understands text/img/video and creates informative responses

hive ai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

microsoft phi-3.5-vision-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

nvidia nv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

rakuten rakutenai-7b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

rakuten rakutenai-7b-chat

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

nvidia usdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

meta llama-3.1-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

meta llama-3.1-8b-instruct

Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

nvidia maisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

google gemma-2-27b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

google gemma-2-9b-it

Cutting-edge text generation model text understanding, transformation, and code generation.

nvidia nvclip

NV-CLIP is a multimodal embeddings model for image and text.

stabilityai stable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

meta llama3-70b-instruct

Powers complex conversations with superior contextual understanding, reasoning and text generation.

meta llama3-8b-instruct

Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.

nvidia vista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

google gemma-7b

Cutting-edge text generation model text understanding, transformation, and code generation.