NVIDIA
Explore
Models
Blueprints
GPUs
Docs
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

Models

Deploy and scale models on your GPU infrastructure of choice with NVIDIA NIM inference microservices
Publisher
Use Case
NIM Type
Sorting by Most Recent

nvidiallama-3_2-nemoretriever-300m-embed-v2

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

microsoftTRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

stabilityaistable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

black-forest-labsFLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

nvidianemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

nvidiallama-3_2-nemoretriever-300m-embed-v1

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

nvidianemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

googlegemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

googlegemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

nvidiallama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nvidiallama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

black-forest-labsFLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

mistralaimistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

black-forest-labsFLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

nvidiacosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

nvidianv-embedcode-7b-v1

The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.

microsoftphi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

nvidiacosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

nvidiallama-3.2-nv-embedqa-1b-v2

Multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage efficiency.

nvidiallama-3.2-nv-rerankqa-1b-v2

Fine-tuned reranking model for multilingual, cross-lingual text question-answering retrieval, with long context support.

university-at-buffalocached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

baidupaddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

hivedeepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

metallama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

metallama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

nvidiavila

Multi-modal vision-language model that understands text/img/video and creates informative responses

hiveai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

nvidianv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

nvidiausdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

nvidianv-embedqa-e5-v5

English text embedding model for question-answering retrieval.

nvidianv-embedqa-mistral-7b-v2

Multilingual text question-answering retrieval, transforming textual information into dense vector representations.

nvidiamaisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

nvidianvclip

NV-CLIP is a multimodal embeddings model for image and text.

stabilityaistable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

nvidiaocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

baaibge-m3

Embedding model for text retrieval tasks, excelling in dense, multi-vector, and sparse retrieval.

nvidiavisual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

nvidiaretail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

googlepaligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

nvidiarerank-qa-mistral-4b

GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

nvidiavista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.