Cutting-edge lightweight open language model exceling in high-quality reasoning.
Phi-3 Medium-128K-Instruct Model Card
Developers | Microsoft GenAI |
---|---|
Description | Phi-3 Medium is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the small version comes in two variants 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. |
License | MIT |
Third-Party Community Consideration | This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case. |
Architecture | Phi-3 Medium has 14B parameters and is a dense decoder-only Transformer model using the same tokenizer as Phi-3 Mini. |
Inputs | Text. It’s best suited for prompts using the chat format. |
Context length | 128K tokens |
GPUS | 512 H100-80G |
Training time | 42 days |
Training data | 4.8T tokens |
Outputs | Generated text in response to the input |
Dates | Our models were trained between Feb 2024 and April 2024 |
Status | This is a static model trained on an offline dataset with cutoff date October 2023 for publicly available data. Future versions of the tuned models may be released as we improve models. |
Primary use cases | The model is intended for commercial and research use in English. The model provides uses for applications which require 1) memory/compute constrained environments; 2) latency bound scenarios; 3) strong reasoning (especially math and logic). Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features. |
---|---|
Out-of-scope use cases | Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. |
Our training data includes a wide variety of sources, totaling 4.8 trillion tokens (including 10% multilingual), and is a combination of 1) publicly available documents filtered rigorously for quality, selected high-quality educational data, and code; 2) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.); 3) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness.
We evaluated the model across a breadth of public and internal benchmarks to understand the model capabilities in the most comprehensive way under multiple tasks and conditions. More specifically,
Phi-3 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and a modified version of RLHF (Reinforcement Learning from Human Feedback) by utilizing human-labeled and synthetic datasets, including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.
Prior to release, Phi-3 family of models followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the AI Red Team at Microsoft to assess safety risks posed by Phi-3-Medium in both average and adversarial user scenarios. The assessment was done in predetermined eight risk categories with automated scoring followed by thorough manual reviews of the model responses.
** Please refer to the technical report for more details of safety alignment
To understand the capabilities, we compare Phi-3 Medium with a set of models over a variety of benchmarks using Microsoft's internal benchmark platform BabelBench (See Appendix A for benchmark methodology).
At the high-level overview of the model quality on representative benchmarks:
Category | Benchmark | Phi-3 Medium-128K-Instruct | Command R+ 104B | Mixtral-8x22B | Llama-3-70B-Instruct | GPT3.5-Turbo-1106 | Gemini Pro | GPT-4-Turbo-1106 (Chat) |
---|---|---|---|---|---|---|---|---|
Popular aggregated benchmark | AGI Eval | 49.7 | 50.1 | 54.0 | 56.9 | 48.4 | 49.0 | 59.6 |
MMLU | 76.6 | 73.8 | 76.2 | 80.2 | 71.4 | 66.7 | 84.0 | |
BigBench Hard | 77.9 | 74.1 | 81.8 | 80.4 | 68.3 | 75.6 | 87.7 | |
Language Understanding | ANLI | 57.3 | 63.4 | 65.2 | 68.3 | 58.1 | 64.2 | 71.7 |
HellaSwag | 81.6 | 78.0 | 79.0 | 82.6 | 78.8 | 76.2 | 88.3 | |
Reasoning | ARC Challenge | 91.0 | 86.9 | 91.3 | 93.0 | 87.4 | 88.3 | 95.6 |
ARC Easy | 97.6 | 95.7 | 96.9 | 98.2 | 96.3 | 96.1 | 98.8 | |
BoolQ | 86.5 | 86.1 | 82.7 | 89.1 | 79.1 | 86.4 | 91.3 | |
CommonsenseQA | 82.2 | 82.0 | 82.0 | 84.4 | 79.6 | 81.8 | 86.7 | |
MedQA | 67.6 | 59.2 | 67.9 | 78.5 | 63.4 | 58.2 | 83.7 | |
OpenBookQA | 87.2 | 86.8 | 88.6 | 91.8 | 86 | 86.4 | 93.4 | |
PIQA | 87.8 | 86.4 | 85.0 | 85.3 | 86.6 | 86.2 | 90.1 | |
Social IQA | 79.0 | 75.3 | 78.2 | 81.1 | 68.3 | 75.4 | 81.7 | |
TruthfulQA (MC2) | 74.3 | 57.8 | 67.4 | 81.9 | 67.7 | 72.6 | 85.2 | |
WinoGrande | 78.9 | 77.0 | 75.3 | 83.3 | 68.8 | 72.2 | 86.7 | |
Factual Knowledge | TriviaQA | 73.9 | 82.8 | 84.5 | 78.5 | 85.8 | 80.2 | 73.3 |
Math | GSM8K Chain of Thought | 87.5 | 78.3 | 83.8 | 93.5 | 78.1 | 80.4 | 94.2 |
MATH | 24.8 | 20.9 | 32.8 | 46.3 | 42.6 | 30.8 | 56.9 | |
Code Generation | HumanEval | 58.5 | 61.6 | 39.6 | 78.7 | 62.2 | 64.4 | 79.9 |
MBPP | 73.8 | 68.9 | 70.7 | 81.3 | 77.8 | 73.2 | 86.7 | |
Average | 74.7 | 72.3 | 74.1 | 80.7 | 72.7 | 73.2 | 83.8 |
We take a closer look at different categories across 80 public benchmark datasets at the table below:
Category | Phi-3-Medium-128K-Instruct | Command R+ 104B | Mixtral-8x22B | Llama-3-70B-Instruct | GPT-3.5-Turbo-1106 | Gemini Pro | GPT-4-Turbo-1106 (Chat) |
---|---|---|---|---|---|---|---|
Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
Reasoning | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |
Language understanding | 75.3 | 75.7 | 78.7 | 77.9 | 70.4 | 75.3 | 81.6 |
Code generation | 64.2 | 68.6 | 60.0 | 69.3 | 70.4 | 66.7 | 76.1 |
Math | 52.9 | 45.3 | 52.5 | 59.7 | 52.8 | 50.9 | 67.1 |
Factual knowledge | 47.5 | 60.3 | 60.6 | 52.4 | 63.4 | 54.6 | 45.9 |
Multilingual | 62.2 | 67.8 | 69.8 | 62.0 | 67.0 | 73.4 | 78.2 |
Robustness | 70.2 | 57.9 | 65.5 | 78.7 | 69.3 | 69.7 | 84.6 |
Overall, the Phi-3 Medium-128K-Instruct with only 14B-param achieves a similar level of language understanding, code, and math as much larger models. Moreover, the model outperforms bigger models in reasoning capability and only behind GPT-4-Turbo. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much world knowledge, which can be seen for example with low performance on TriviaQA. However, we believe such weakness can be resolved by augmenting Phi-3-Medium with a search engine.
Phi-3 Medium-128K-Instruct supports 128K context length, therefore the model is capable of several long context tasks including long document/meeting summarization, long document QA. We see with just 14B params the Phi-3 Medium outperforms models with the same parameters size, and competitive with models on a much bigger size such as Mixtral 8x22B.
Benchmark | Phi-3-Medium-128K-Instruct | Command R+ 104B | Mistral-8x22b | Llama-3-70B-Instruct | Gemini-Pro | GPT-4-Turbo-1106 (Chat) |
---|---|---|---|---|---|---|
GovReport | 25.6 | 12.7 | 18.6 | 12.5 | 25.1 | 26.2 |
QMSum | 20.9 | 4.4 | 20.8 | 4.4 | 22.7 | 23.5 |
Qasper | 27.3 | 16.6 | 36.5 | 30.5 | 41.4 | 42.3 |
SQuALITY | 25.4 | 17.3 | 17.4 | 25.7 | 23 | 26.1 |
SummScreenFD | 12.4 | 5.2 | 14.2 | 9.2 | 16.2 | 19 |
Average | 22.3 | 11.2 | 21.5 | 16.5 | 25.7 | 27.4 |
Given the nature of the training data, the Phi-3 Medium-128K-Instruct model is best suited for prompts using the chat format as follows:
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
After obtaining the Phi-3 Medium-128K-Instruct model checkpoints, users can use this sample code for inference.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline torch.random.manual_seed(0) model_id = "microsoft/Phi-3-Medium-128K-Instruct" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="cuda", torch_dtype="auto", trust_remote_code=True,) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},] pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,) generation_args = { "max_new_tokens": 500, "return_full_text": False, "temperature": 0.0, "do_sample": False, } output = pipe(messages, **generation_args) print(output[0]['generated_text'])
ONNX runtime now supports Phi3 small models across platforms and hardware.
Optimized phi-3 models are also published here in ONNX format, to run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets. DirectML GPU acceleration is supported for Windows desktops GPUs (AMD, Intel, and NVIDIA).
Along with DML, ONNX Runtime provides cross platform support for Phi3 mini across a range of devices CPU, GPU, and mobile.
Here are some of the optimized configurations we have added:
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:
We include a brief word on methodology here - and in particular, how we think about optimizing prompts.
In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date.
There are, however, some exceptions to this. In some cases, we see a model that performs worse than expected on a given eval due to a failure to respect the output format. For example:
However, we do not: