Cutting-edge text generation model text understanding, transformation, and code generation.
Gemma is a family of lightweight, state-of-the art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.
Author: Google
Model Page: Gemma
Model Card: https://ai.google.dev/gemma/docs/model_card
Resources and Technical Documentation:
Terms of Use: Terms
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case.
Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Output: Text can be question, a prompt, or a document to be
summarized.
Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Output: Generated English-language text in response to the input (e.g.,
an answer to the question, a summary of the document).
These models have certain limitations that users should be aware of.
Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
These models were trained on a text dataset that includes a wide variety of sources, totaling 6 trillion tokens. Here are the primary training data sources:
The combination of these diverse data sources is crucial for training a powerful language model that can handle a wide variety of different tasks and text formats.
Here are the key data cleaning and filtering methods applied to the training data:
The endpoint available on NGC catalog is accelerated by TensorRT-LLM, an open-source library for optimizing inference performance. Gemma is compatible across NVIDIA AI platforms—from the datacenter, cloud, to the local PC with RTX GPU systems.
Gemma models use a vocabulary size of 256K and support a context length of up to 8K while using rotary positional embedding (RoPE). With support for Position Interpolation (PI) available in TensorRT-LLM, Gemma models using RoPE can support longer output sequence lengths at inference time while retaining original model architecture.
The model is converted to .nemo for easy customization with NVIDIA NeMo framework – an end-to-end framework to curate data, tune models, and deploy anywhere. It supports various customization techniques including RLHF, SFT, LoRA, and Steer-LM.
Model evaluation metrics and results.
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Benchmark | Metric | 2B Params | 7B Params |
---|---|---|---|
MMLU | 5-shot, top-1 | 42.3 | 64.3 |
HellaSwag | 0-shot | 71.4 | 81.2 |
PIQA | 0-shot | 77.3 | 81.2 |
SocialIQA | 0-shot | 59.7 | 51.8 |
BooIQ | 0-shot | 69.4 | 83.2 |
WinoGrande | partial score | 65.4 | 72.3 |
CommonsenseQA | 7-shot | 65.3 | 71.3 |
OpenBookQA | 47.8 | 52.8 | |
ARC-e | 73.2 | 81.5 | |
ARC-c | 42.1 | 53.2 | |
TriviaQA | 5-shot | 53.2 | 63.4 |
Natural Questions | 5-shot | 23 | |
HumanEval | pass@1 | 22.0 | 32.3 |
MBPP | 3-shot | 29.2 | 44.4 |
GSM8K | maj@1 | 17.7 | 46.4 |
MATH | 4-shot | 11.8 | 24.3 |
AGIEval | 24.2 | 41.7 | |
BIG-Bench | 35.2 | 55.1 | |
------------------------------ | ------------- | ----------- | --------- |
Average | 54.0 | 56.4 |
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
The results of ethics and safety evaluations are within acceptable thresholds for meeting internal policies for categories such as child safety, content safety, representational harms, memorization, large-scale harms. On top of robust internal evaluations, the results of well known safety benchmarks like BBQ, BOLD, Winogender, Winobias, RealToxicity, and TruthfulQA are shown here.
Benchmark | Metric | 2B Params | 7B Params |
---|---|---|---|
RealToxicity | average | 6.86 | 7.90 |
BOLD | 45.57 | 49.08 | |
CrowS-Pairs | top-1 | 45.82 | 51.33 |
BBQ Ambig | 1-shot, top-1 | 62.58 | 92.54 |
BBQ Disambig | top-1 | 54.62 | 71.99 |
Winogender | top-1 | 51.25 | 54.17 |
TruthfulQA | 44.84 | 31.81 | |
Winobias 1_2 | 56.12 | 59.09 | |
Winobias 2_2 | 91.10 | 92.23 | |
Toxigen | 29.77 | 39.59 | |
------------------------------ | ------------- | ----------- | --------- |
The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
Risks Identified and Mitigations:
Google's commitments to operate sustainably.
At the time of release, this family of models provides high-performance open large language model implementations designed from the ground up for Responsible AI development compared to similarly sized models.
Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.