Explore
NIM
Docs
Forums
Login
nvidia
/
vila
PREVIEW
Multi-modal vision-language model that understands text/img/video and creates informative responses
VLM
Vision language model
image caption
image to text
Build with this NIM
Experience
Model Card
API Reference
Sorry, your browser does not support inline SVG.