Try NVIDIA NIM APIs

Explore Models Blueprints GPUs Docs

Manage My Privacy

Contact

Search Results

Searching for: Image-Text Retrieval

Sorting by Most Recent

stabilityai stable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

image generation text-to-image stabilityai

black-forest-labs FLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

image generation text-to-image run-on-rtx black-forest-labs

nvidia nemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

optical character recognition table extraction nemo retriever data ingestion extraction nvidia

nvidia llama-3_2-nemoretriever-300m-embed-v1

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

retrieval augmented generation text-to-embedding nemo retriever nvidia

nvidia nemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

optical character recognition table extraction nemo retriever data ingestion extraction nvidia

google gemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition visual qa chat google

google gemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition visual qa chat google

nvidia llama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nemo retriever embedding retrieval augmented generation text-to-embedding nvidia

nvidia Biomedical AI-Q Research Agent Blueprint

Build advanced AI agents within the biomedical domain using the AI-Q Blueprint and the BioNeMo Virtual Screening Blueprint

launchable agent blueprint blueprint retrieval-augmented generation llm nvidia

nvidia llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligence multiple image understanding ocr nvidia

black-forest-labs FLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

image generation text-to-image run-on-rtx black-forest-labs

mistralai mistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generation multimodal image understanding mistralai

black-forest-labs FLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

image generation text-to-image run-on-rtx black-forest-labs

nvidia Build an AI Agent for Enterprise Research

Build a custom deep researcher powered by state-of-the-art models that continuously process and synthesize multimodal enterprise data, enabling reasoning, planning, and refinement to generate comprehensive reports.

nim launchable llama nemotron reasoning blueprint enterprise retrieval-augmented generation nvidia ai nemo retriever nvidia

nvidia Synthetic Manipulation Motion Generation for Robotics

Generate exponentially large amounts of synthetic motion trajectories for robot manipulation from just a few human demonstrations.

nvidia omniverse blueprint synthetic data enterprise robotics physical ai robot learning humanoids nvidia isaac gr00t text-to-world image-to-world teleop nvidia

nvidia cosmos-predict1-7b

Generalist model to generate future world state as videos from text and image prompts to create synthetic training data for robots and autonomous vehicles.

synthetic data generation autonomous vehicles physical ai robotics text-to-world image-to-world nvidia

nvidia cosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

synthetic data generation physical ai policy evaluation robotics video-to-world nvidia

nvidia nv-embedcode-7b-v1

The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.

nemo retriever embedding retrieval augmented generation nvidia

microsoft phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

speech recognition visual qa language generation image-to-text chart and table understanding microsoft

nvidia Build an Enterprise RAG pipeline

Continuously extract, embed, and index multimodal data for fast, accurate semantic search. Built on world-class NeMo Retriever models, the RAG blueprint connects AI applications to multimodal enterprise data wherever it resides.

nim launchable blueprint enterprise retrieval-augmented generation nvidia ai nemo retriever nvidia

nvidia cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlm vision language model image caption image to text nvidia

nvidia llama-3.2-nv-embedqa-1b-v2

Multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage efficiency.

nemo retriever run-on-rtx embedding retrieval augmented generation text-to-embedding nvidia

nvidia llama-3.2-nv-rerankqa-1b-v2

Fine-tuned reranking model for multilingual, cross-lingual text question-answering retrieval, with long context support.

nemo retriever retrieval augmented generation reranking nvidia

university-at-buffalo cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

nemo retriever chart element detection image-to-text university-at-buffalo

baidu paddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

optical character recognition table extraction optical character detection nemo retriever data ingestion run-on-rtx extraction baidu

hive deepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

computer vision ai safety deep fake detection content moderation hive

nvidia Build an AI Virtual Assistant

Create intelligent virtual assistants for customer service across every industry

customer service launchable blueprint retrieval-augmented generation llm contact center nvidia ai nvidia

meta llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

image-text retrieval visual qa image-to-text image captioning visual grounding meta

meta llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

image-text retrieval visual qa image captioning image-to-text visual grounding meta

nvidia vila

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlm vision language model image caption image to text nvidia

hive ai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

image classification computer vision ai safety content moderation hive

nvidia nv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

image-to-embedding computer vision deepstream nvidia nim object classification nvidia

microsoft florence-2

Vision foundation model capable of performing diverse computer vision and vision language tasks.

image classification image object detection cv multimodal vision assistant vlm visual question answering computer vision language generation image-to-text text-to-image microsoft

nvidia usdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

openusd synthetic data generation digital twin usd text-to-3d nvidia nim nvidia

nvidia nv-embedqa-e5-v5

English text embedding model for question-answering retrieval.

embedding retrieval augmented generation nemo retriever text-to-embedding nvidia

nvidia nv-embedqa-mistral-7b-v2

Multilingual text question-answering retrieval, transforming textual information into dense vector representations.

nemo retriever embedding retrieval augmented generation nvidia

nvidia maisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

image generation medical imaging nvidia nim nvidia

nvidia nvclip

NV-CLIP is a multimodal embeddings model for image and text.

computer vision multimodal embeddings text and image nvidia nim run-on-rtx nvidia

stabilityai stable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

image generation text-to-image stabilityai

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

optical character recognition image optical character detection cv vlm computer vision tao toolkit video nvidia

baai bge-m3

Embedding model for text retrieval tasks, excelling in dense, multi-vector, and sparse retrieval.

embeddings retrieval augmented generation text-to-embedding baai

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

image image generation cv image segmentation vlm computer vision tao toolkit video nvidia nim nvidia

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

object detection image cv vlm computer vision tao toolkit video nvidia nim nvidia

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

image cv vision assistant vlm visual question answering computer vision language generation image-to-text video google

nvidia rerank-qa-mistral-4b

GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

ranking retrieval augmented generation nvidia

microsoft kosmos-2

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

image cv multimodal vlm visual question answering computer vision image understanding image-to-text video microsoft

nvidia neva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

image cv vision assistant non-commercial use only vlm visual question answering computer vision image-to-text video nvidia

adept fuyu-8b

Multi-modal model for a wide range of tasks, including image understanding and language generation.

image cv multimodal vlm computer vision image understanding language generation image-to-text video adept

nvidia vista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

interactive annotation image segmentation non-commercial use only medical imaging nvidia