NVIDIA
Explore Models Blueprints GPUs Docs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

Search Results

Searching for: Image-Text Retrieval
Sorting by Most Recent

stabilityaistable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

image generationtext-to-imagestabilityai

black-forest-labsFLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

image generationtext-to-imagerun-on-rtxblack-forest-labs

nvidianemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

optical character recognitiontable extractionnemo retrieverdata ingestionextractionnvidia

nvidiallama-3_2-nemoretriever-300m-embed-v1

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

retrieval augmented generationtext-to-embeddingnemo retrievernvidia

nvidianemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

optical character recognitiontable extractionnemo retrieverdata ingestionextractionnvidia

googlegemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionvisual qachatgoogle

googlegemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generationspeech recognitionvisual qachatgoogle

nvidiallama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nemo retrieverembeddingretrieval augmented generationtext-to-embeddingnvidia

nvidiaBiomedical AI-Q Research Agent Blueprint

Build advanced AI agents within the biomedical domain using the AI-Q Blueprint and the BioNeMo Virtual Screening Blueprint

launchableagent blueprintblueprintretrieval-augmented generationllmnvidia

nvidiallama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligencemultiple image understandingocrnvidia

black-forest-labsFLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

image generationtext-to-imagerun-on-rtxblack-forest-labs

mistralaimistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generationmultimodalimage understandingmistralai

black-forest-labsFLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

image generationtext-to-imagerun-on-rtxblack-forest-labs

nvidiaBuild an AI Agent for Enterprise Research

Build a custom deep researcher powered by state-of-the-art models that continuously process and synthesize multimodal enterprise data, enabling reasoning, planning, and refinement to generate comprehensive reports.

nimlaunchablellama nemotronreasoningblueprintenterpriseretrieval-augmented generationnvidia ainemo retrievernvidia

nvidiaSynthetic Manipulation Motion Generation for Robotics

Generate exponentially large amounts of synthetic motion trajectories for robot manipulation from just a few human demonstrations.

nvidia omniverseblueprintsynthetic dataenterpriseroboticsphysical airobot learninghumanoidsnvidia isaac gr00ttext-to-worldimage-to-worldteleopnvidia

nvidiacosmos-predict1-7b

Generalist model to generate future world state as videos from text and image prompts to create synthetic training data for robots and autonomous vehicles.

synthetic data generationautonomous vehiclesphysical airoboticstext-to-worldimage-to-worldnvidia

nvidiacosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

synthetic data generationphysical aipolicy evaluationroboticsvideo-to-worldnvidia

nvidianv-embedcode-7b-v1

The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.

nemo retrieverembeddingretrieval augmented generationnvidia

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

speech recognitionvisual qalanguage generationimage-to-textchart and table understandingmicrosoft

nvidiaBuild an Enterprise RAG pipeline

Continuously extract, embed, and index multimodal data for fast, accurate semantic search. Built on world-class NeMo Retriever models, the RAG blueprint connects AI applications to multimodal enterprise data wherever it resides.

nimlaunchableblueprintenterpriseretrieval-augmented generationnvidia ainemo retrievernvidia

nvidiacosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlmvision language modelimage captionimage to textnvidia

nvidiallama-3.2-nv-embedqa-1b-v2

Multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage efficiency.

nemo retrieverrun-on-rtxembeddingretrieval augmented generationtext-to-embeddingnvidia

nvidiallama-3.2-nv-rerankqa-1b-v2

Fine-tuned reranking model for multilingual, cross-lingual text question-answering retrieval, with long context support.

nemo retrieverretrieval augmented generationrerankingnvidia

university-at-buffalocached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

nemo retrieverchart element detectionimage-to-textuniversity-at-buffalo

baidupaddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

optical character recognitiontable extractionoptical character detectionnemo retrieverdata ingestionrun-on-rtxextractionbaidu

hivedeepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

computer visionai safetydeep fake detectioncontent moderationhive

nvidiaBuild an AI Virtual Assistant

Create intelligent virtual assistants for customer service across every industry

customer servicelaunchableblueprintretrieval-augmented generationllmcontact centernvidia ainvidia

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

image-text retrievalvisual qaimage-to-textimage captioningvisual groundingmeta

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

image-text retrievalvisual qaimage captioningimage-to-textvisual groundingmeta

nvidiavila

Multi-modal vision-language model that understands text/img/video and creates informative responses

vlmvision language modelimage captionimage to textnvidia

hiveai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

image classificationcomputer visionai safetycontent moderationhive

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

image-to-embeddingcomputer visiondeepstreamnvidia nimobject classificationnvidia

microsoftflorence-2

Vision foundation model capable of performing diverse computer vision and vision language tasks.

image classificationimageobject detectioncvmultimodalvision assistantvlmvisual question answeringcomputer visionlanguage generationimage-to-texttext-to-imagemicrosoft

nvidiausdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

openusdsynthetic data generationdigital twinusdtext-to-3dnvidia nimnvidia

nvidianv-embedqa-e5-v5

English text embedding model for question-answering retrieval.

embeddingretrieval augmented generationnemo retrievertext-to-embeddingnvidia

nvidianv-embedqa-mistral-7b-v2

Multilingual text question-answering retrieval, transforming textual information into dense vector representations.

nemo retrieverembeddingretrieval augmented generationnvidia

nvidiamaisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

image generationmedical imagingnvidia nimnvidia

nvidianvclip

NV-CLIP is a multimodal embeddings model for image and text.

computer visionmultimodal embeddingstext and imagenvidia nimrun-on-rtxnvidia

stabilityaistable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

image generationtext-to-imagestabilityai

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

optical character recognitionimageoptical character detectioncvvlmcomputer visiontao toolkitvideonvidia

baaibge-m3

Embedding model for text retrieval tasks, excelling in dense, multi-vector, and sparse retrieval.

embeddingsretrieval augmented generationtext-to-embeddingbaai

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

imageimage generationcvimage segmentationvlmcomputer visiontao toolkitvideonvidia nimnvidia

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

object detectionimagecvvlmcomputer visiontao toolkitvideonvidia nimnvidia

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

imagecvvision assistantvlmvisual question answeringcomputer visionlanguage generationimage-to-textvideogoogle

nvidiarerank-qa-mistral-4b

GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

rankingretrieval augmented generationnvidia

microsoftkosmos-2

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

imagecvmultimodalvlmvisual question answeringcomputer visionimage understandingimage-to-textvideomicrosoft

nvidianeva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

imagecvvision assistantnon-commercial use onlyvlmvisual question answeringcomputer visionimage-to-textvideonvidia

adeptfuyu-8b

Multi-modal model for a wide range of tasks, including image understanding and language generation.

imagecvmultimodalvlmcomputer visionimage understandinglanguage generationimage-to-textvideoadept

nvidiavista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

interactive annotationimage segmentationnon-commercial use onlymedical imagingnvidia