google

paligemma

PREVIEW

Vision language model adept at comprehending text and visual inputs to produce informative responses

Model Overview

Description:

The Google PaLIGemma-3B-mix model is a one-shot visual language understanding solution for image-to-text generation. This model is ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Google's (PaliGemma Model Card.

License, Acceptable Use, and Research Privacy Policy

By using this model, you are agreeing to the terms and conditions of the License, Acceptable Use Policy and Google Research Privacy Policy.

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLIP + Gemma

Input:

Input Format: Image + Text
Input Parameters: Image: Red, Green, and Blue (RGB); Text: String
Other Properties Related to Input: Prompt to caption the image or a question.

Output:

Output Format: Text
Output Parameters: temperature, top_p, max_tokens
Other Properties Related to Output: Stream

Supported Operating System(s):

  • Linux

Inference:

Engine: Triton
Test Hardware: Other