microsoft/kosmos-2
PREVIEWGroundbreaking multimodal model designed to understand and reason about visual elements in images.
Input
Output
A young family is sitting in the grass with their dog.
Groundbreaking multimodal model designed to understand and reason about visual elements in images.