Vision language model adept at comprehending text and visual inputs to produce informative responses
The Google PaLIGemma-3B-mix model is a one-shot visual language understanding solution for image-to-text generation. This model is ready for commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Google's (PaliGemma Model Card.
By using this model, you are agreeing to the terms and conditions of the License, Acceptable Use Policy and Google Research Privacy Policy.
Architecture Type: Transformer
Network Architecture: SigLIP + Gemma
Input Format: Image + Text
Input Parameters: Image: Red, Green, and Blue (RGB); Text: String
Other Properties Related to Input: Prompt to caption the image or a question.
Output Format: Text
Output Parameters: temperature, top_p, max_tokens
Other Properties Related to Output: Stream
Engine: Triton
Test Hardware: Other