microsoft /

kosmos-2

PREVIEW

Groundbreaking multimodal model designed to understand and reason about visual elements in images.

Model Overview

Description:

Kosmos-2 model is a groundbreaking multimodal large language model (MLLM). Kosmos-2 is designed to ground text to the visual world, enabling it to understand and reason about visual elements in images.

Terms of use

By using this model, you are agreeing to the terms and conditions of the license, acceptable use policy and Microsoft Research privacy policy.

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: GPT + CLIP

Input:

Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: None

Output:

Output Format: Text
Output Parameters: Max output tokens, Bounding boxes
Other Properties Related to Output: None

Supported Operating System(s):

Linux

Inference:

Engine: Triton
Test Hardware: Other