Vision language model that excels in understanding the physical world using structured reasoning on videos or images.