NVIDIA
Explore Models Blueprints GPUs
Terms of Use

|

Privacy Policy

|

Manage My Privacy

|

Contact

Copyright © 2025 NVIDIA Corporation

google

paligemma

PREVIEW

Vision language model adept at comprehending text and visual inputs to produce informative responses

language generationvision assistantvisual question answeringcomputer visioncvimageimage-to-textvideovlm
Get API Key
API Reference
Accelerated by DGX Cloud

Model Overview

Description:

The Google PaLIGemma-3B-mix model is a one-shot visual language understanding solution for image-to-text generation. This model is ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Google's (PaliGemma Model Card.

License, Acceptable Use, and Research Privacy Policy

By using this model, you are agreeing to the terms and conditions of the License, Acceptable Use Policy and Google Research Privacy Policy.

References(s):

  • SigLIP paper
  • Gemma paper
  • PaLIGemma on HuggingFace

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLIP + Gemma

Input:

Input Format: Image + Text
Input Parameters: Image: Red, Green, and Blue (RGB); Text: String
Other Properties Related to Input: Prompt to caption the image or a question.

Output:

Output Format: Text
Output Parameters: temperature, top_p, max_tokens
Other Properties Related to Output: Stream

Supported Operating System(s):

  • Linux

Inference:

Engine: Triton
Test Hardware: Other