NVIDIA
Explore
Models
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

Search Results

Searching for: image caption
Sorting by Most Recent

nvidianemotron-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

text and table extractiondocument parsingsupported language - english

nvidianemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

language generationchatImage-to-Textvision assistantvisual question answering

microsoftTRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

text-to-3dRun-on-RTXimage-to-3d

stabilityaistable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

Image GenerationText-to-Image

black-forest-labsFLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

Image GenerationText-to-ImageRun-on-RTX

nvidianemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Optical Character RecognitionTable Extractionnemo retrieverdata ingestionextraction

nvidianemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Optical Character RecognitionTable Extractionnemo retrieverdata ingestionextraction

googlegemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionVisual QAchat

googlegemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionVisual QAchat

nvidiallama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nemo retrieverembeddingRetrieval Augmented GenerationText-to-Embedding

nvidiallama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligencechatmultiple image understandingOCR

black-forest-labsFLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

Image GenerationText-to-ImageRun-on-RTX

mistralaimistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generationchatmultimodalimage understanding

nvidia3D Guided Generative AI

Create high quality images using Flux.1 in ComfyUI, guided by 3D.

BlueprintRun-on-RTXNVIDIA AI

black-forest-labsFLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

Image GenerationText-to-ImageRun-on-RTX

nvidiaSynthetic Manipulation Motion Generation for Robotics

Generate exponentially large amounts of synthetic motion trajectories for robot manipulation from just a few human demonstrations.

NVIDIA OmniverseBlueprintsynthetic dataEnterpriseroboticsphysical airobot learningHumanoidsNVIDIA Isaac GR00Ttext-to-worldimage-to-worldteleop

nvidiacosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

Synthetic Data GenerationPhysical AIpolicy evaluationroboticsvideo-to-world

googlegemma-3-27b-it

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Vision AssistantchatVisual Question AnsweringLanguage GenerationImage-to-Text

nvidianemoretriever-parse

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

optical character recognitionnemo retrieverdata ingestiontable extractionsupported language - english

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech RecognitionVisual QAchatLanguage GenerationImage-to-TextChart and Table Understanding

nvidiacosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLMVision language modelimage captionimage to text

university-at-buffalocached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

nemo retrieverChart Element DetectionImage-To-Text

baidupaddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

Optical Character RecognitionTable ExtractionOptical Character Detectionnemo retrieverdata ingestionrun-on-rtxextraction

hivedeepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

computer visionAI safetydeep fake detectionContent moderation

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Image-Text RetrievalVisual QAchatImage-to-TextImage CaptioningVisual Grounding

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Image-Text RetrievalVisual QAimage captioningchatImage-to-TextVisual Grounding

nvidiavila

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLMVision language modelimage captionimage to text

hiveai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

image classificationcomputer visionAI safetyContent moderation

microsoftphi-3.5-vision-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Vision AssistantVisual Question AnsweringLanguage GenerationImage-to-Text

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

Image-to-Embeddingcomputer visiondeepstreamNVIDIA NIMobject Classification

nvidiausdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

OpenUSDSynthetic Data GenerationDigital TwinUSDText-to-3D

nvidiamaisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

Image GenerationMedical ImagingNVIDIA NIM

nvidianvclip

NV-CLIP is a multimodal embeddings model for image and text.

Computer visionmultimodal embeddingstext and imageRun-on-rtx

stabilityaistable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

Image GenerationText-to-Image

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character RecognitionimageOptical Character Detectioncvvlmcomputer visionTAO Toolkitvideo

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

imageImage GenerationcvImage Segmentationvlmcomputer visionTAO ToolkitvideoNVIDIA NIM

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

Object Detectionimagecvvlmcomputer visionTAO ToolkitvideoNVIDIA NIM

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

imagecvVision AssistantvlmVisual Question Answeringcomputer visionLanguage GenerationImage-to-Textvideo

nvidiavista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

Interactive AnnotationImage SegmentationNon-Commercial Use OnlyMedical Imaging