Groundbreaking multimodal model designed to understand and reason about visual elements in images.
Kosmos-2 model is a groundbreaking multimodal large language model (MLLM). Kosmos-2 is designed to ground text to the visual world, enabling it to understand and reason about visual elements in images.
By using this model, you are agreeing to the terms and conditions of the license, acceptable use policy and Microsoft Research privacy policy.
Architecture Type: Transformer
Network Architecture: GPT + CLIP
Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: None
Output Format: Text
Output Parameters: Max output tokens, Bounding boxes
Other Properties Related to Output: None
Linux
Engine: Triton
Test Hardware: Other