
paligemma
PREVIEWVision language model adept at comprehending text and visual inputs to produce informative responses
Model Overview
Description:
The Google PaLIGemma-3B-mix model is a one-shot visual language understanding solution for image-to-text generation. This model is ready for commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Google's (PaliGemma Model Card.
License, Acceptable Use, and Research Privacy Policy
By using this model, you are agreeing to the terms and conditions of the License, Acceptable Use Policy and Google Research Privacy Policy.
References(s):
Model Architecture:
Architecture Type: Transformer
Network Architecture: SigLIP + Gemma
Input:
Input Format: Image + Text
Input Parameters: Image: Red, Green, and Blue (RGB); Text: String
Other Properties Related to Input: Prompt to caption the image or a question.
Output:
Output Format: Text
Output Parameters: temperature, top_p, max_tokens
Other Properties Related to Output: Stream
Supported Operating System(s):
- Linux
Inference:
Engine: Triton
Test Hardware: Other