
Vision language model that excels in understanding the physical world using structured reasoning on videos or images.

Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

Elevate Shopping Experiences Online and In Stores.

Reasoning vision language model (VLM) for physical AI and robotics.

Ingest massive volumes of live or archived videos and extract insights for summarization and interactive Q&A

A general purpose multimodal, multilingual 128 MoE model with 17B parameters.

Build a custom enterprise research assistant powered by state-of-the-art models that process and synthesize multimodal data, enabling reasoning, planning, and refinement to generate comprehensive reports.

A multimodal, multilingual 16 MoE model with 17B parameters.

Create intelligent virtual assistants for customer service across every industry

Multi-modal vision-language model that understands text/img and creates informative responses


Cutting-edge vision-language model exceling in retrieving text and metadata from images.

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Grounding dino is an open vocabulary zero-shot object detection model.

Advanced AI model detects faces and identifies deep fake images.

Robust image classification model for detecting and managing AI-generated content.

Multi-modal vision-language model that understands text/img/video and creates informative responses

Cutting-edge open multimodal model exceling in high-quality reasoning from images.

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

State-of-the-art small language model delivering superior accuracy for chatbot, virtual assistants, and content generation.