Sorting by Most Recent
microsoft/phi-3.5-vision-instruct
Cutting-edge open multimodal model exceling in high-quality reasoning from images.
microsoft/florence-2
Vision foundation model capable of performing diverse computer vision and vision language tasks.
nvidia/nvclip
NV-CLIP is a multimodal embeddings model for image and text.
microsoft/phi-3-vision-128k-instruct
Cutting-edge open multimodal model exceling in high-quality reasoning from images.
google/deplot
One-shot visual language understanding model that translates images of plots into tables.