Cutting-edge text generation model text understanding, transformation, and code generation.
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.
Author: Google Model Page: Gemma
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case.
Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy.
Summary description and brief definition of inputs and outputs.
Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Output: Text can be question, a prompt, or a document to be
summarized.
Output Type(s): Text
Output Format(s): String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Generated English-language text in response to the input (e.g.,
an answer to the question, a summary of the document).
@article{gemma_2024, title={Gemma}, url={https://www.kaggle.com/m/3301}, DOI={10.34740/KAGGLE/M/3301}, publisher={Kaggle}, author={Gemma Team}, year={2024} }
These models have certain limitations that users should be aware of.
Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
Data used for model training and how the data was processed.
These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 13t tokens and the 9B model was trained with 8t tokens. Here are the key components:
The combination of these diverse data sources is crucial for training a powerful language model that can handle a wide variety of different tasks and text formats.
Here are the key data cleaning and filtering methods applied to the training data:
The endpoint available on NGC catalog is accelerated by TensorRT-LLM, an open-source library for optimizing inference performance. Gemma is compatible across NVIDIA AI platforms—from the datacenter, cloud, to the local PC with RTX GPU systems.
Gemma models use a vocabulary size of 256K and support a context length of up to 4K while using rotary positional embedding (RoPE). With support for Position Interpolation (PI) available in TensorRT-LLM, Gemma models using RoPE can support longer output sequence lengths at inference time while retaining original model architecture.
Training was done using JAX and ML Pathways.
JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones.
Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."
Model evaluation metrics and results.
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Benchmark | Metric | Gemma PT 9B | Gemma PT 27B |
---|---|---|---|
MMLU | 5-shot, top-1 | 71.3 | 75.2 |
HellaSwag | 10-shot | 81.9 | 86.4 |
PIQA | 0-shot | 81.7 | 83.2 |
SocialIQA | 0-shot | 53.4 | 53.7 |
BoolQ | 0-shot | 84.2 | 84.8 |
WinoGrande | partial score | 80.6 | 83.7 |
ARC-e | 0-shot | 88.0 | 88.6 |
ARC-c | 25-shot | 68.4 | 71.4 |
TriviaQA | 5-shot | 76.6 | 83.7 |
Natural Questions | 5-shot | 29.2 | 34.5 |
HumanEval | pass@1 | 40.2 | 51.8 |
MBPP | 3-shot | 52.4 | 62.6 |
GSM8K | 5-shot, maj@1 | 68.6 | 74.0 |
MATH | 4-shot | 36.6 | 42.3 |
AGIEval | 3-5-shot | 52.8 | 55.1 |
BIG-Bench | 3-shot, CoT | 68.2 | 74.9 |
------------------------------ | ------------- | ----------- | ------------ |
Ethics and safety evaluation approach and results.
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
The results of ethics and safety evaluations are within acceptable thresholds for meeting internal policies for categories such as child safety, content safety, representational harms, memorization, large-scale harms. On top of robust internal evaluations, the results of well-known safety benchmarks like BBQ, BOLD, Winogender, Winobias, RealToxicity, and TruthfulQA are shown here.
Benchmark | Metric | Gemma 2 IT 9B | Gemma 2 IT 27B |
---|---|---|---|
RealToxicity | average | 8.25 | 8.84 |
CrowS-Pairs | top-1 | 37.47 | 36.67 |
BBQ Ambig | 1-shot, top-1 | 88.58 | 85.99 |
BBQ Disambig | top-1 | 82.67 | 86.94 |
Winogender | top-1 | 79.17 | 77.22 |
TruthfulQA | 50.27 | 51.60 | |
Winobias 1_2 | 78.09 | 81.94 | |
Winobias 2_2 | 95.32 | 97.22 | |
Toxigen | 39.30 | 38.42 | |
------------------------ | ------------- | --------------- | ---------------- |
The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
Risks identified and mitigations:
At the time of release, this family of models provides high-performance open large language model implementations designed from the ground up for Responsible AI development compared to similarly sized models.
Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.