NVIDIA
Explore Models Blueprints GPUs Docs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

microsoft

kosmos-2

PREVIEW

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

image understandingmultimodalvisual question answeringcomputer visioncvimageimage-to-textvideovlm
Get API Key
API Reference
Accelerated by DGX Cloud

Model Overview

Description:

Kosmos-2 model is a groundbreaking multimodal large language model (MLLM). Kosmos-2 is designed to ground text to the visual world, enabling it to understand and reason about visual elements in images.

Terms of use

By using this model, you are agreeing to the terms and conditions of the license, acceptable use policy and Microsoft Research privacy policy.

References(s):

  • KOSMOS-2 paper

Model Architecture:

Architecture Type: Transformer
Network Architecture: GPT + CLIP

Input:

Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: None

Output:

Output Format: Text
Output Parameters: Max output tokens, Bounding boxes
Other Properties Related to Output: None

Supported Operating System(s):

Linux

Inference:

Engine: Triton
Test Hardware: Other